Transform a string to an array: with regex, split or explode: I need some help in a simple string!

good day dear community,

well i am in big big trouble - i need some regex to solve a problem! Can you help me a bit! That would be great! Well - i mused alot how to call the subject: Finally i came to: “Regex or explode to array: I need some help in a simple string!”

i have a spreadsheed in calc. with some records. There is a column that contains the following information

Ecole Saint-Exupery
Rue Saint-Malo 24
67544 Paris

Well i need to have those lines divided into at least three columns

name: Ecole Saint-Exupery
street: Rue Saint-Malo 24
postal code and town 67544 Paris

Or even better - i have divided the postal code and town into two seperate columns!? Question: is this possible? Can (or should) i do this in calc (open document-formate)? Do i need to have to use a regex and perl or am i able to solve this issues without an regex?

Note - finally i need to transfer the data into MySQL-database…

I look forward to a tipp…

greetings

BTW: you can see all the things in a real world-live-demo: KM-Bayern - Suche in der bayerischen Schuldatenbank - see the filed

Schulname
Straße
please Ort

These field contains three things - the name, the street and the Postal Code and the town! Question: can this be divided into parts!? If you copy and paste the information - and drop it to calc then you get all the information in only one cell. How to divide and seperate all those information into three cells or even four?

BTW - i tried to translate the information to hex-code - see the follwoing…:

Staatl. Realschule Grafenau
Rachelweg 20
94481 Grafenau

00000000: 53 74 61 61 74 6C 2E 20  52 65 61 6C 73 63 68 75
00000010: 6C 65 20 47 72 61 66 65  6E 61 75 20 0A 52 61 63
00000020: 68 65 6C 77 65 67 20 32  30 0A 39 34 34 38 31 20
00000030: 20 47 72 61 66 65 6E 61  75 20 20

Well you see true Hex-code: but i do not know if this helps here!??

Can you help me to solve the problem. Do i need to have a regex!?
Or do i benefit from transfering this dataset via explode() or split ()-Function to an array!?

Well i am not sure - can you give me some starting points!?

That would be great!

i love to hear from you

greetings…
db1:)

Many thanks in advance for any and all help!

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Calc has some test editing functions that should eliminate the need to
export the data, though I’d be curious why in the world you had the data
in a spreadsheet in the first place. Anyway, if your data are in A1 then
put the following three formulas in B1, C1, D1, E1, and F1 respectively:

=FIND(CHAR(10),A1:A1)
=FIND(CHAR(10),A1,B1+1)
=MID(A1,1,B1)
=MID(A1,B1,C1-B1)
=MID(A1,C1,99)

Of course, you could combine these somewhat to use two fewer columns, but
this should give you an idea of what is really going on. Hint: 10 in
decimal = 0A in hexadecimal.

Good luck.

On 02/13/2011 11:36 AM, dilbertone wrote:
>
> good day dear community,
>
> well i am in big big trouble - i need some regex to solve a problem!
> Can you help me a bit! That would be great! Well - i mused alot how to
> call the subject: Finally i came to: “Regex or explode to array: I need
> some help in a simple string!”
>
> i have a spreadsheed in calc. with some records. There is a column that
> contains the following information
>
> Ecole Saint-Exupery
> Rue Saint-Malo 24
> 67544 Paris
>
> Well i need to have those lines divided into at least three columns
>
> name: Ecole Saint-Exupery
> street: Rue Saint-Malo 24
> postal code and town 67544 Paris
>
> Or even better - i have divided the postal code and town into two
> seperate columns!? Question: is this possible? Can (or should) i do this
> in calc (open document-formate)? Do i need to have to use a regex and
> perl or am i able to solve this issues without an regex?
>
> Note - finally i need to transfer the data into MySQL-database…
>
> I look forward to a tipp…
>
> greetings
>
> BTW: you can see all the things in a real world-live-demo: ‘KM-Bayern -
> Suche in der bayerischen Schuldatenbank’
> (http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750) - see the
> filed
>
> Schulname
> Straße
> please Ort
>
> These field contains three things - the name, the street and the Postal
> Code and the town! Question: can this be divided into parts!? If you
> copy and paste the information - and drop it to calc then you get all
> the information in only one cell. How to divide and seperate all those
> information into three cells or even four?
>
> BTW - i tried to translate the information to hex-code - see the
> follwoing…:
>
> Staatl. Realschule Grafenau
> Rachelweg 20
> 94481 Grafenau
>
>
> PHP code:
> --------------------
> 00000000: 53 74 61 61 74 6C 2E 20 52 65 61 6C 73 63 68 75
> 00000010: 6C 65 20 47 72 61 66 65 6E 61 75 20 0A 52 61 63
> 00000020: 68 65 6C 77 65 67 20 32 30 0A 39 34 34 38 31 20
> 00000030: 20 47 72 61 66 65 6E 61 75 20 20
>
> --------------------
>
>
> Well you see true Hex-code: but i do not know if this helps here!??
>
> Can you help me to solve the problem. Do i need to have a regex!?
> Or do i benefit from transfering this dataset via explode() or split
> ()-Function to an array!?
>
> Well i am not sure - can you give me some starting points!?
>
> That would be great!
>
> i love to hear from you
>
> greetings…
> db1:)
>
>
> Many thanks in advance for any and all help!
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJNWIaDAAoJEF+XTK08PnB5dwQP/12Y62xVR+wREStkyW/NnpVb
4+FfXyeZXJhF6ReEaiuR61wT/FnN/LOhkeU5LPIpGgqs8907G4hV0Q7fhkoVee7d
UHH6nAyqVgub3AjcZKw4mvJVvb0sc/E6PmuPnx5Q4TXRjbfJ6cGF0tj8825EQMAN
iSSx+f4eSe4UwlBoDDKO6UhsxXdnShLTRiUNcpfH3SJ89fm9sllQc2CeAnol0uqF
gAkOhtUgerNNmiFBRxkdYa7w3p2Id1/6LW+fF1PYued7134SMvWZHocgfQKgvtsi
XvMbA5Tg4LeWUTXNFilm9SFyso+zGtCt69mwiVx83g4mdeyBB8bPWb6iM1FB7c7d
cQ7adXlXYg8PNgx3ZNy1nh+Yaoyw3NzZaL9J1FLcjWgM8zJ69GxvVFGxE8HFEvn0
D2QMwv6RSd+ZGbsUDZnS2pLqOvAY4FgZAFX+K0nuG/c4mpa9Ic3p+vuHVytfuVmw
XXpfww+xbUs8M5CLwYWU0cfA7gACndilc/92lE5wIl6Igue6148kMgg/V4wpKzuF
py5wQqw2EuSyZ8odPfjzjo+KgUxqJQuWla7XrSm5DyM33Qrz10a71tQBixsUrZqt
hDm+ws/UkMSqBMEoN7uiaK5X6Qxbg2HQ9xuI40+4QFFRozcy9KNSDFUlk34j8JNz
hJdKEmpMUnBjBcEwwIrI
=Zy2V
-----END PGP SIGNATURE-----

Ab has a great idea.

Otherwise are you exporting the spreadsheet into comma separated formatted file?
If so could you post title and the first 4-8 lines from the CVS file?

hello dear ab - hello dear tararpharazon !

great to hear from you both!

ab i will try this out!


=FIND(CHAR(10),A1:A1)
=FIND(CHAR(10),A1,B1+1)
=MID(A1,1,B1)
=MID(A1,B1,C1-B1)
=MID(A1,C1,99)

that looks pretty impressive!

one question - # if i want to parse the data - or copy n paste the data from the site KM-Bayern - Suche in der bayerischen Schuldatenbank

how would you (or someone else) go round the calc-step Are you able to
put the data straigt into a mysql-database!? Well - that would go if we write a parser!

@tararpharazon: i love your idea - they sound great - i will try out this way too!

many many thanks -
i come back and report all my findings.

greetings
db 1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sure, you could, either with or without using a spreadsheet, pull out the
data (it’d take a couple minutes to do the same thing in Perl or some
other scripting language) and then it’d just be a matter of the SQL calls
to your favorite database to make the connection and insert the data. You
could use the spreadsheet to create the SQL and then copy/paste that into
a database client to insert the rows. The options are nearly endless and
depends mostly on your experience, if you are doing this one or over an
dover, your tools, etc.

Good luck.

On 02/15/2011 01:36 PM, dilbertone wrote:
>
> hello dear ab - hello dear tararpharazon !
>
> great to hear from you both!
>
>
> ab i will try this out!
>
>
> PHP code:
> --------------------
>
> =FIND(CHAR(10),A1:A1)
> =FIND(CHAR(10),A1,B1+1)
> =MID(A1,1,B1)
> =MID(A1,B1,C1-B1)
> =MID(A1,C1,99)
>
> --------------------
>
>
> that looks pretty impressive!
>
> one question - # if i want to parse the data - or copy n paste the data
> from the site ‘KM-Bayern - Suche in der bayerischen Schuldatenbank’
> (http://192.68.214.70/km/asps/schulsuche.asp?q=lindau&a=20)
>
> how would you (or someone else) go round the calc-step Are you able
> to
> put the data straigt into a mysql-database!? Well - that would go if
> we write a parser!
>
> @tararpharazon: i love your idea - they sound great - i will try out
> this way too!
>
> many many thanks -
> i come back and report all my findings.
>
> greetings
> db 1
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJNWuePAAoJEF+XTK08PnB5U6cQAJP+ULBhoB+QQTijsFI7oeQZ
d500P/LHrA3ZkVNOOng0I9gYiyMakNdRkZZihuDzg+wY9i2uYXQZ0DOoKJeLCFbI
wxZkSajbie5NWx6XrP3C3AaSqNIcy5HoxhS6GmTxjrg3HcA7idfbxGAAYvHFGOvY
cVyYsq7Kj6eR054Sq8BSSXNXIPgOtO+XnwOqys3M6SRQHIkoaJTFvCT0xwzrwtZL
d9ZI8cl/LyhCy68fSJ4moE9lhaqREiS2tq+jLw4l6GvFBR2asZvdbaDOZcXjKh3E
7sjQqGTv6l5e5yF/KtWuHDZdsBNKtvCL2/JMq1OKOgXTKv/3BGWMeNGKPVoVH1qG
IH21QyiT3MRy5sYnVrewS7LNlz1ZBqJgapIgbyjDjHyiYA5CQEr6spLYZ8kYFq8M
CtoOmg64KQKnltMWEYicIVYTfq1sx5v9tuTDkZTfTtG/YLJfxkAr+ufga1C5jp4b
6W4xSC0q27TnxYy7aCZ4OxMgAzsu0a3PKbhIGL1bUhsqCyj11mvwITSjb0zDbkI7
yYuBDZ2lYYMDvQ6aGJJ7kFRjc7riAdiQcICR1pbnl5eEPGY9x6GIlLwcrbEUql0K
cmB5JKHRtTw+DFdi6o3iIKgxB+LuxuqNRzOcvPufPKADSx90ABbIC05nWXzkhATP
gI1eL9ikhxGyWcpPqRIu
=CQ+0
-----END PGP SIGNATURE-----

hi you both #
had no luck with exporting in cvs neither nor with the above mentioned code



Calc has some test editing functions that should eliminate the need to
export the data, though I'd be curious why in the world you had the data
in a spreadsheet in the first place. Anyway, if your data are in A1 then
put the following three formulas in B1, C1, D1, E1, and F1 respectively:

=FIND(CHAR(10),A1:A1)
=FIND(CHAR(10),A1,B1+1)
=MID(A1,1,B1)
=MID(A1,B1,C1-B1)
=MID(A1,C1,99)

  

hmm - probably i did not try out all the options - but both ways seemed to bee be limited… Tomorrow i will try out more ways to do the thing.

i come back and report all my findings.

if there are more hints - regarding the handling of the data there KM-Bayern - Suche in der bayerischen Schuldatenbank

i would be lucky

Hi
I just used the following;


wget http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=100&s=1750

Then this perl script should get it like you want;


#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;

my $html = ('schulsuche.asp?q=a');
my $te = HTML::TableExtract->new;

$te->parse_file($html);

my ($table) = $te->tables;

for my $row ( $table->rows ) {
cleanup(@$row);
print "@$row
";
}

sub cleanup {
for ( @_ ) {
s/\s+//;
s/\xa0 ]+\z//;
s/\s+/ /g;
}
}

You can then use LWP to grab the pages your wanting to parse.


Cheers Malcolm °¿° (Linux Counter #276890)
SUSE Linux Enterprise Desktop 11 (x86_64) Kernel 2.6.32.27-0.2-default
up 4 days 19:34, 3 users, load average: 0.04, 0.03, 0.05
GPU GeForce 8600 GTS Silent - Driver Version: 260.19.36

Hi
Added the get bit… :wink:


#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;

my $te = HTML::TableExtract->new;

use LWP::Simple;

getstore('http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=100&s=2750', 'temp.html') or die 'Unable to get page';

$te->parse_file('temp.html');

my ($table) = $te->tables;

for my $row ( $table->rows ) {
   cleanup(@$row);
   print "@$row
";
}

sub cleanup {
    for ( @_ ) {
        s/\s+/ /g;
    }
}

hi

many thanks i try it out .- i come back
later today!

hello dear malcomlewis :wink:

many many thanks for the great idea! that rocks!! Really!

Again - many thanks for the hints!

Well lwp is just great: It has many many powerful options: one question regarding the spider-logic: see the site here:

KM-Bayern - Suche in der bayerischen Schuldatenbank

it shows following:

Treffer 1 - 20 von insgesamt 6150

that means: hits 1 to 20 from a total = 6150

Well here a question: how can i force the script to fetch all (!) the sites - beginning from the first page and ending at the last one!?

**see the range: **

KM-Bayern - Suche in der bayerischen Schuldatenbank → the first 20 resluts:
KM-Bayern - Suche in der bayerischen Schuldatenbank → the last 20 resluts: … out of 6140

how to call this in the get-argument - how to search the whole range?

from… ?q=e&a=20&s=0
to … ?q=e&a=20&s=6140

well this is a question of ***string concatenation: ***

in php we can do [in a simmilar task] something like the following:


$number_array = array ("123", "43567", "9287","3323");
for($i=0; $i<$count($number_array); $i ++) {

$new_url = $orig_url . $number_array$i];

/* do something with the new url */ 


how to do it with Perl in the above mentioned example?


from.... ?q=e&a=20&s=0 
to  ... ?q=e&a=20&s=6140 

btw - again see the full code:



#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;

my $te = HTML::TableExtract->new;

use LWP::Simple;

getstore('http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=100&s=2750', 'temp.html') or die 'Unable to get page';

$te->parse_file('temp.html');

my ($table) = $te->tables;

for my $row ( $table->rows ) {
   cleanup(@$row);
   print "@$row
";
}

sub cleanup {
    for ( @_ ) {
        s/\s+/ /g;
    }
}


and here see some results:

lfd. Nr. Schul- nummer Schulname Stra�e PLZ Ort Telefon Fax Schulart Webseite
2751 8787 Mittelschule Lindau (Bodensee) - Aeschach� Anheggerstr. 18 88131� Lindau�Aeschach 08382/944555 08382/944554 Volksschulen www.hs-aeschach.de
2752 8789 Volksschule Lindau (Bodensee) - Hoyren�(Grundschule) Hoyerbergstr. 33 88131� Lindau� 08382/944581 08382/944582 Volksschulen Volksschule Lindau (B) - Hoyren (Grundschule)
2753 8790 Volksschule Lindau (Bodensee) - Reutin-Zech�(Grundschule) Schulstr. 23 88131� Lindau� 08382/975261 08382/975262 Volksschulen
2754 8791 Mittelschule Lindau (Bodensee) - Reutin� Schulstr. 23 88131� Lindau�Reutin 08382/975264 08382/975265 Volksschulen www.hs-reutin.de
2755 8799 Volksschule Lindau (Bodensee) - Oberreitnau�(Grundschule) Hepachstr. 9 88131� Lindau�Oberreitnau 08382/944591 08382/944592 Volksschulen

well - i have to get all the pages - therefore in need to create an approbiate spider-logic that contains a way of string concatenation - in order to fetch all the pages…

i look forward to a hint…

many many thanks in advance!

db1:)

Hi
Looking now, should have something ready later on this evening or tomorrow…stay tuned :slight_smile:

Hi
Here you go, I haven’t added any comments about what the bits are doing, but you should be able to work most of it out :wink:

Modify the $suchbegriffe variable to change the data, I’ve also set the records to 50 at a time which is the max allowed for the website.


#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);

my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "b";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;

&workDir();
chdir $processdir;
&processURL();
print "
Press <enter> to continue
";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...
";
print "Processed data saved to webdata_for_$suchbegriffe\_$displaydate.txt
";
unlink 'processing.html';
die "
";

sub processURL() {
print "
Processing $url_to_process$suchbegriffe&a=$treffer&s=$range
";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';

   while( <tempfile.html> ) {
      open( FH, "$_" ) or die;
      while( <FH> ) {
         if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
            $total_records = $6;
            print "Total records to process is $total_records
";
            }
         }
         close FH;
   }
   unlink 'tempfile.html';
}

sub processData() {
   while ( $range <= $total_records) {
      getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
      $te->parse_file('processing.html');
      my ($table) = $te->tables;
      for my $row ( $table->rows ) {
         cleanup(@$row);
         print OUTFILE "@$row
";
      }
      $| = 1;  
      print "Processed records $range to $counter";
      print "\r";
      $counter = $counter + 50;
      $range = $range + 50;
      $te = HTML::TableExtract->new;
   }
}

sub cleanup() {
   for ( @_ ) {
      s/\s+/ /g;
   }
}

sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
   mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
   }
}


hi malcomlewis

[only a short message - since i am in a hurry right now.!!]

i am overwhelmed - many many thanks to you. You are very very supportive…
!!!
thx :slight_smile: alot
db1

Hello Malcolmlewis

all runs like a charme _ and is great!

one little thing left : in the German language we have special characters which are not recognized correctly … see the following lines - out of a result:

lfd. Nr. Schul- nummer Schulname Stra�e PLZ Ort Telefon Fax Schulart Webseite
1 0401 M�dchenrealschule Marienburg,�Abenberg, der Di�zese Eichst�tt Marienburg 1 91183� Abenberg� 09178/509210 Realschulen mrs-marienburg.homepage.t-online.de
2 6581 Volksschule Abenberg�(Grundschule) G�ss�belstr. 2 91183� Abenberg� 09178/215 09178/905060 Volksschulen News & E-Mail bei t-online.de | Politik, Sport, Unterhaltung & Ratgeber
3 6913 Mittelschule Abenberg� G�ss�belstr. 2 91183� Abenberg� 09178/215 09178/905060 Volksschulen News & E-Mail bei t-online.de | Politik, Sport, Unterhaltung & Ratgeber
4 0402 Johann-Turmair-Realschule�Staatliche Realschule Abensberg Stadionstra�e 46 93326� Abensberg� 09443/9143-0,12,13 09443/914330 Realschulen RS Abensberg - startseite
5 3041 Cabrini-Schule Offenstetten, Priv. F�rderzentrum�F�rderschwerp. geist.Entwickl. d. Kath.Jugendf�rs. Am Schmiedweiher 8 93326� Abensberg�Offenstetten 09443/9188-3 09443/918855 Volksschulen zur sonderp�dog. F�rderung Cabrinischule Offenstetten - Förderzentrum mit Förderschwerpunkt geistige Entwicklung
6 3074 Private Berufsschule zur sonderp�d. F�rderung,�F�rderschwerpunkt Lernen, Abensberg Regensburger Stra�e 60 93326� Abensberg� 09443/709191 09443/709193 Berufsschulen zur sonderp�dog. F�rderung Home

in the following lines i add the correct characters:

lfd. Nr. Schul- nummer Schulname **Straße **PLZ Ort Telefon Fax Schulart Webseite
1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183 Abenberg 09178/509210 Realschulen mrs-marienburg.homepage.t-online.de
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183 Abenberg

see some of the corrections in bold…

Well how can we rewrite the regex to go round the issue with the special characters…?

any hint on this here … !?
db1

Hi
Add the following lines after the use POSIX line;


use locale;
$| = 1;
setlocale POSIX::LC_COLLATE, "de_DE";

hello Malcomlewis

hello - many thanks for the quick answer!

i do so!!

thx alot! greetings db1

Hi
It looks like it only works once… :frowning: Need to investigate more. If the script errors, just delete those three lines.

I’ve never used a different language in perl :frowning:

hello malcolmlewis

many thanks in advance!

in investigated here…



#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);

use locale;
$| = 1;
setlocale POSIX::LC_COLLATE, "de_DE";

my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;

Bareword found where operator expected at perl_bayern_new.pl line 10, near “setlocale POSIX::LC_COLLATE”
(Do you need to predeclare setlocale?)
syntax error at perl_bayern_new.pl line 10, near “setlocale POSIX::LC_COLLATE”
BEGIN not safe after errors–compilation aborted at perl_bayern_new.pl line 41.
martin@suse-linux:~/perl> perl perl_bayern_new.pl
Bareword found where operator expected at perl_bayern_new.pl line 10, near “setlocale POSIX::LC_COLLATE”
(Do you need to predeclare setlocale?)
syntax error at perl_bayern_new.pl line 10, near “setlocale POSIX::LC_COLLATE”
BEGIN not safe after errors–compilation aborted at perl_bayern_new.pl line 41.

i continue the investigations here …

greetings
db1

Hi
Strange thing is, it worked for me…but just once, and now continues to work (It may break after a logout/login).

Now there is this perl module;
Lingua::de::ASCII - search.cpan.org

I will package up as an rpm and try out.

Hi
Well I can’t for the life of me get it to ‘NOT’ work now, every time I run it converts the characters properly…