Transform a string to an array: with regex, split or explode: I need some help in a simple string!

Hi
Looking this afternoon at it for you :slight_smile:

Hi Malcom,

many thanks for your answer!

this is very very great. It is a great pleasure for me to be here… Many many thanks for all you did!! Just overwhelming!!!

I look forward
Greetings :slight_smile: Dilbertone

Hi
I’ve cleaned it up a little;


#!/usr/bin/perl
use strict;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);

my $total_records = 0;
my $alpha = "x";
my $results = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $percent = 0;

workDir();
chdir $processdir;
processURL();
print "
Press <enter> to continue
";
<>;
my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
processData();
close $outfile;
print "Finished processing $total_records records...
";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt
";
unlink 'processing.html';

sub processURL() {
print "
Processing $url_to_process$alpha&a=$results&s=$range
";
getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';

   while( <tempfile.html> ) {
      open( FH, "$_" ) or die;
      while( <FH> ) {
         if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
            $total_records = $6;
            print "Total records to process is $total_records
";
            }
         }
         close FH;
   }
   unlink 'tempfile.html';
}

sub processData() {
   while ( $range <= $total_records) {
      my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
      getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
      $te->parse_file('processing.html');
      my ($table) = $te->tables;
      foreach my $ts ($te->table_states) {
         foreach my $row ($ts->rows) {
            cleanup(@$row);
	    # Add a table column delimiter in this case ||
            print $outfile join("||", @$row)."
";
            }
         }
      $| = 1;  
      print "Processed records $range to $counter";
      print "\r";
      $counter = $counter + 50;
      $range = $range + 50;
   }
}

sub cleanup() {
   for ( @_ ) {
      s/\s+/ /g;
   }
}

sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
   mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
   }
}

So you get output like;


1||9752||Deutsche Schule Alamogordo  USA  Alamogorde - New Mexico  || ||Deutschsprachige Auslandsschule|| 
2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET  Alexandrien - Ägypten  || ||Begegnungsschule (Auslandsschuldienst)|| 
3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503  Altötting  ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de 
4||0006||Max-Reger-Gymnasium Amberg  Kaiser-Wilhelm-Ring 7 92224  Amberg  ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de

With the || being the delimiter.

So the next step you want is into say a spreadsheet?

Hi
OK, for this to work you need to install perl-Spreadsheet-WriteExcel from here;
software.opensuse.org: Search Results

Just download the rpm and install manually;


sudo zypper in <name_of_rpm>

This will create a new xls file as well as a delimited file.


#!/usr/bin/perl
use strict;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
use Spreadsheet::WriteExcel;

my $total_records = 0;
my $alpha = "x";
my $results = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;

workDir();
chdir $processdir;
processURL();
print "
Press <enter> to continue
";
<>;
my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
my $workbook = Spreadsheet::WriteExcel->new("webdata_for_$alpha\_$displaydate.xls");
my $worksheet = $workbook->add_worksheet();
open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
processData();
close $outfile;
print "Finished processing $total_records records...
";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt
";
unlink 'processing.html';

sub processURL() {
print "
Processing $url_to_process$alpha&a=$results&s=$range
";
getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';

   while( <tempfile.html> ) {
      open( FH, "$_" ) or die;
      while( <FH> ) {
         if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
            $total_records = $6;
            print "Total records to process is $total_records
";
            }
         }
         close FH;
   }
   unlink 'tempfile.html';
}

sub processData() {
my $dataloop = 0;
   while ( $range <= $total_records) {
      my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
      getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
      $te->parse_file('processing.html');
      my ($table) = $te->tables;
      foreach my $ts ($te->table_states) {
          foreach my $row ($ts->rows) {
            cleanup(@$row);
	    $worksheet->write($dataloop,0, $row, @{$row});
	    # Add a table column delimiter in this case ||
            print $outfile join("||", @$row)."
";
	    $dataloop = $dataloop + 1;
            }
         }
      $| = 1;  
      print "Processed records $range to $counter";
      print "\r";
      $counter = $counter + 50;
      $range = $range + 50;
   }
}

sub cleanup() {
   for ( @_ ) {
      s/\s+/ /g;
   }
}

sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
   mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
   }
}

Hi Malcom
many thanks - i will do this later tonight. I will add the perl-module tonight and will run the programme.

Many thankx
dilbertone :slight_smile:

not runned the perl-Spreadsheet-module yet.

are we able to separate postal code from town…!?

1||9752||Deutsche Schule Alamogordo USA Alamogorde - New Mexico || ||Deutschsprachige Auslandsschule||
2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET Alexandrien - Ägypten || ||Begegnungsschule (Auslandsschuldienst)||
3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503 Altötting ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de
4||0006||Max-Reger-Gymnasium Amberg Kaiser-Wilhelm-Ring 7 92224 Amberg ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de

similar like this:

my @cols = qw(
rownum
number
name
phone
type
website
);

my @fields = qw(
rownum
number
name
street
postal
town
phone
fax

guess so - guess that the perl - to - spreadsheet mod will do this either…

try it out later the day

hello dear Malcome,

this is great! I love it. One very very last question: how can i have a few more fileds (accordingliy to the tagtet site:
KM-Bayern - Suche in der bayerischen Schuldatenbank )

how to get some more fields - that are separated… note - especially the adress is important to have separated

    rownum
    number
    name
    street
    postal
    town
    phone
    fax
    type
    website

see here the code:


#!/usr/bin/perl
use strict;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);

my $total_records = 0;
my $alpha = "x";
my $results = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $percent = 0;

workDir();
chdir $processdir;
processURL();
print "
Press <enter> to continue
";
<>;
my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
processData();
close $outfile;
print "Finished processing $total_records records...
";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt
";
unlink 'processing.html';

sub processURL() {
print "
Processing $url_to_process$alpha&a=$results&s=$range
";
getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';

   while( <tempfile.html> ) {
      open( FH, "$_" ) or die;
      while( <FH> ) {
         if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
            $total_records = $6;
            print "Total records to process is $total_records
";
            }
         }
         close FH;
   }
   unlink 'tempfile.html';
}

sub processData() {
   while ( $range <= $total_records) {
      my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
      getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
      $te->parse_file('processing.html');
      my ($table) = $te->tables;
      foreach my $ts ($te->table_states) {
         foreach my $row ($ts->rows) {
            cleanup(@$row);
	    # Add a table column delimiter in this case ||
            print $outfile join("||", @$row)."
";
            }
         }
      $| = 1;  
      print "Processed records $range to $counter";
      print "\r";
      $counter = $counter + 50;
      $range = $range + 50;
   }
}

sub cleanup() {
   for ( @_ ) {
      s/\s+/ /g;
   }
}

sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
   mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
   }
}



output:


1||9752||Deutsche Schule Alamogordo  USA  Alamogorde - New Mexico  || ||Deutschsprachige Auslandsschule|| 
2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET  Alexandrien - Ägypten  || ||Begegnungsschule (Auslandsschuldienst)|| 
3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503  Altötting  ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de 
4||0006||Max-Reger-Gymnasium Amberg  Kaiser-Wilhelm-Ring 7 92224  Amberg  ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de


With the || being the delimiter.

My problem is: i need to have more fields - i need to have the following divided:

well - how to add more fields?
This obviously has to be done in this line here, doesn t it!?

my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);

But how. I tried out several things - but i dont helped. I allways got bad results.

love to hear from you

best regards
martin aka db1

Hi
Can you try the xls ones as well? Here is an updated version to name the sheet;


#!/usr/bin/perl
use strict;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
use Spreadsheet::WriteExcel;

my $total_records = 0;
my $alpha = "x";
my $results = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;

workDir();
chdir $processdir;
processURL();
print "
Press <enter> to continue
";
<>;
my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
my $workbook = Spreadsheet::WriteExcel->new("webdata_for_$alpha\_$displaydate.xls");
my $sheetname = "webdata_for_$alpha";
my $worksheet = $workbook->add_worksheet($sheetname);
open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
processData();
close $outfile;
print "Finished processing $total_records records...
";
print "
Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt and $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.xls
";
unlink 'processing.html';

sub processURL() {
print "
Processing $url_to_process$alpha&a=$results&s=$range
";
getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';

   while( <tempfile.html> ) {
      open( FH, "$_" ) or die;
      while( <FH> ) {
         if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
            $total_records = $6;
            print "
Total records to process is $total_records
";
            }
         }
         close FH;
   }
   unlink 'tempfile.html';
}

sub processData() {
my $dataloop = 0;
   while ( $range <= $total_records) {
      my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
      getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
      $te->parse_file('processing.html');
      my ($table) = $te->tables;
      foreach my $ts ($te->table_states) {
          foreach my $row ($ts->rows) {
            cleanup(@$row);
	    $worksheet->write($dataloop,0, $row, @{$row});
	    # Add a table column delimiter in this case ||
            print $outfile join("||", @$row)."
";
	    $dataloop = $dataloop + 1;
            }
         }
      $| = 1;  
      print "Processed records $range to $counter";
      print "\r";
      $counter = $counter + 50;
      $range = $range + 50;
   }
}

sub cleanup() {
   for ( @_ ) {
      s/\s+/ /g;
   }
}

sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
   mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
   }
}

I will look at the next bit soon :slight_smile:

Hello dear Malcom,

[just a quick note - i am not at home - i am downtown in a internet-cafe]

…many many thanks for the quick reply. Great to hear from you! I will run the xls-version at friday evening. Then i have much more time! I can do tests all the weekend.

Meanwhile many many thanks for all you did! You are great!

I come back on friday…

have a great time!!
Many many regards
dilbertone :slight_smile:

Hello dear Malcom,

i have downloaded the Spreadsheet::WriteExcel module - your script runs very very nicely!

Can you help me with the last bit: the separation of the data chunk (/name and adress)

My problem is: i need to have more fields - i need to have the following divided:
see the target-url: KM-Bayern - Suche in der bayerischen Schuldatenbank

here we have for eg. the** following record: **

1751 2736 Volksschule Gräfelfing (Grundschule) Schulstraße 2 82166 Gräfelfing 089/8543740 089/8549222 Volksschulen Grundschule Graefelfing

Well in this record i nave to have the data separated

the name: Volksschule Gräfelfing (Grundschule)
the adress: (street): Schulstraße 2
postal-code: 82166
**town: ** Gräfelfing
… and the rest… - but this is separated allready. All runs very very nice - we have only one bit left…!

Question: can this be done with some Perl-code!?
in other words: can we have more fields?

That would be great!! I have tried to figure it out how to do - but i was not able to solve it!?

love to hear from you

best regards
martin aka db1 :wink:

Hi
Can you confirm the postal code is always 5 digits long? Does it always
start with an 8?


Cheers Malcolm °¿° (Linux Counter #276890)
SUSE Linux Enterprise Desktop 11 (x86_64) Kernel 2.6.32.27-0.2-default
up 5 days 14:56, 6 users, load average: 0.05, 0.13, 0.22
GPU GeForce 8600 GTS Silent - Driver Version: 260.19.26

Hello Malcom :slight_smile:

thx for answering:

Hi. Can you confirm the postal code is always 5 digits long? Does it always
start with an 8?

The postal-code is allways 5 digits long. All German postal-codes does have 5 digits.

But the postal-codes of the bavarian-schools (and towns of course) does not start allways with an 8

Sometimes it starts with 9

see the following:

**Example1: **
Einhard-Mittelschule Euerdorf
Am Heiligenberg 1
97717 Euerdorf

**Example2: **
Kronberg-Gymnasium Aschaffenburg
Fasaneriestr. 33
63739 Aschaffenburg

but many many have 8 - in fact the most! ( Especially those are round Munich … )

So i guess that - in a statistical regard - we have the most postal codes beginning with

8
9
some have 6

guess that the do not beginn with any other number - but this is not tested yet!

i will have a closer look at this and respond later this day…

many many greetings from South Germany

dilbertone:)

Hi Malcom,

good day -

any progress in the issue!?

Look forward to hear from you…

all the best to you

db 1

Hi
Here you go…


#!/usr/bin/perl
use strict;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
use Spreadsheet::WriteExcel;

# Global variables
my $total_records = 0;
my $alpha = "b";
my $results = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $dataloop = 0;
my $row;
my $row_idx = 0;
my $ifd;
my $school_number;
my $address;
my $postal_code;
my $city;
my $telephone_fax;
my $school;
my $website;
my $ws_data;
my $format;
my $new_address;

# Main program
workDir();
chdir $processdir;
processURL();
print "
Press <enter> to continue
";
<>;
my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
my $workbook = Spreadsheet::WriteExcel->new("webdata_for_$alpha\_$displaydate.xls");
my $sheetname = "webdata_for_$alpha";
my $worksheet = $workbook->add_worksheet($sheetname);
# Add some formatting
$worksheet->set_column('A:B', 6);
$worksheet->set_column('C:C', 100);
$worksheet->set_column('D:D', 6);
$worksheet->set_column('E:E', 20);
$worksheet->set_column('F:F', 35);
$worksheet->set_column('G:H', 45);
$format = $workbook->add_format();
$format->set_align('left');
open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
my $start = time();
processData();
my $finish = time();
close $outfile;
print "Finished processing $total_records records in ", ($finish - $start)," seconds...
";
print "
Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt and $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.xls
";
unlink 'processing.html';
#====================

sub processURL() {
print "
Processing $url_to_process$alpha&a=$results&s=$range
";
getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';

   while( <tempfile.html> ) {
      open( FH, "$_" ) or die;
      while( <FH> ) {
         if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
            $total_records = $6;
            print "
Total records to process is $total_records
";
            }
         }
         close FH;
   }
   unlink 'tempfile.html';
}

sub processData() {
$dataloop = 0;
   while ( $range <= $total_records) {
      my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
      getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
      $te->parse_file('processing.html');
      my ($table) = $te->tables;
      $row_idx = 0;
      foreach my $ts ($te->tables) {
          foreach my $row ($ts->rows) {
            cleanup(@$row);
	    $ifd = $ts->cell($row_idx,0);
            cleanup($ifd);
            $school_number = $ts->cell($row_idx,1);
            cleanup($school_number);
            $address = $ts->cell($row_idx,2);
            cleanup($address);
	    splitData();
            $telephone_fax = $ts->cell($row_idx,3);
            cleanup($telephone_fax);
            $school = $ts->cell($row_idx,4);
            cleanup($school);
            $website = $ts->cell($row_idx,5);
            cleanup($website);
	    my @ws_data = ($ifd, $school_number, $new_address, $postal_code, $city, $telephone_fax, $school, $website);
	    for (my $count = 0; $count < 8; $count++) {
 	       $worksheet->write($dataloop, $count, @ws_data$count], $format);
	    }
	    # Add a table column delimiter in this case ||
	    print $outfile join("||", $ifd, $school_number, $new_address, $postal_code, $city, $telephone_fax, $school, $website), "
";
	    $dataloop++;
	    $row_idx++;
            }
         }
      $| = 1;  
      print "Processed records $range to $counter";
      print "\r";
      $counter = $counter + 50;
      $range = $range + 50;
   }
}

sub cleanup() {
   for ( @_ ) {
      s/\s+/ /g;
   }
}

sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
   mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
   }
}
# Split address field into 3 variables
sub splitData() {
$address =~ /^(.*)(\d{5})(.*)/;
if ($2 ne "" or ( $2 and $3 ne "")) {
   $new_address =  $1;
   $postal_code = $2;
   $city = $3;
   }
   else {
   $new_address = $address;
   }
}

Hello Malcom, :slight_smile:

just a quick answer - since i am not at home at the moment. I am downtown and writing this from a
terminal at the library ]

many many thanks for all - that looks fantastic.

I am overwhelmed. Later tonight (when i am at home) i will run the script.

As allways - you rock !! I am happy to be at such a great place, you are so supportive and generous!
Thx for helping me so much!!

Have a great day!!
all the best to you
dilbertone :wink:

Hi
Your welcome. I do note we may still have to work on the street part, but see how that goes.

hello Malcom

i runned the script - all is very very nice! I am overwhelmed by your great work!

it goes very very well. - The street-part is nice! I think that we can leave this as is at the moment…!

Again many many thanks!!!

you rock!
greetings dilbertone

Hi
Great news :slight_smile: You can add to the format part for lots of spreadsheet features and probably drop the text file unless your debugging.

Hi

many thanks -

it is a very very great script. And i am diggign deeper into it at the weekend. It has 144 lines of great code. i will study these lines - (line per line)

Thx for your endless work! It rocks!!!
greetings from good old Europe :slight_smile:
Dilbertone