Results 1 to 3 of 3

Thread: from mechanize to csv-values with Text::CSV_XS

  1. #1

    Default from mechanize to csv-values with Text::CSV_XS

    hello dear perl-fans




    first of all - many thanks for the help. you helped me alot so far...



    well i was running a script (see below) gave back the following;


    linux-wyee:/home/martin/perl # perl kath_test_1.pl
    [see below]


    Loosdorf Ledochowskastra�e 4 3382 Loosdorf Telefonnummer: 02754 6257 FAX-Nummer: 02754 6257-4


    linux-wyee:/home/martin/perl #
    the script below gives back result like this one;
    Loosdorf
    Ledochowskastraße
    3382 Loostown
    Telefonnummer: 0002754 6257
    FAX-Nummer: 0002754 6257-4


    see more results:
    Marias Neustift Neustift 28 4443 Maria Neustift Telefonnummer: 007250/204 FAX-Nummer: 07250/204-4 E-Mail: prre.marianeustift@dioezese-linz.at
    Marias Puchheim Gmundner Stra�e 1b 4800 Attnang-Puchheim Telefonnummer: 007674/62334 FAX-Nummer: 07674/62334-4 E-Mail: prre.mariapuchheim@dioezese-linz.at
    Marias Scharten Scharten 1 4612 Scharten Telefonnummer: 007272/5210
    Marias Schmolln Maria Schmolln 2 5241 Maria Schmolln Telefonnummer: 007743/2209-12 FAX-Nummer: 07743/2209-17 E-Mail: prre.mariaschmolln@dioezese-linz.at
    Mattighofen R�merstra�e 12 5230 Mattighofen Telefonnummer: 007742/2273 0676/87765221 FAX-Nummer: 07742/2273-22 E-Mail: peipfarre.mattighofen@dioezese-linz.at
    Mauerkirchens Pfarrhofstra�e 4 5270 Mauerkirchen Telefonnummer: 007724/2262




    it does count up - that is great!!


    1 what i wanted is to force the script to run from 00000 to 10000 -
    note: the results should be stored in a csv formatted way...


    for 1. therfore i did the changes: changed the $max_page_num to the max number and change $page to the starting number. this will only print the data to stdout (console)




    now i am trying to modify it... :-)


    well i have to put it to the CSV-values.


    usually this can be done with use Text::CSV_XS (where the Class::CSV is based on).
    Note: A friend also suggested me using Text::CSV which will load up Text::CSV_XS or,


    Well at the moment all the results will only print the data to stdout (console) im sure that i can modify it... :-)


    i just installed the Text::CSV_XS
    took it from here: Text::CSV_XS - search.cpan.org




    now i try to figure out which attributes i do use




    what do you suggest!?
    How to force the script to give back CSV




    here the script without any CSV-modules


    BTW- THIS IS THE SCRIPT - BUT WITHOUT ANY CSV - THINGS...




    kath_test_1.pl
    PHP Code:
      #!/usr/bin/perl       ## This is how i would go about doing what i understand about what your trying todo   ## EXAMPLE only     use 5.014;   use strict;   use warnings;     use WWW::Mechanize;   use HTML::TokeParser;   use Data::Dumper;     my $target_url = 'http://katholisch.at/content/site/pfarrfinder/address/'; #base url   my $page = 4000; #page start number   my $format = '.html'; #ending format   my $max_page_num = 4100; #2300 max page number       #loop threw the pages   for (0..$max_page_num){       #get mech       my $mech = WWW::Mechanize->new();       #set agent       $mech->agent_alias('Windows Mozilla');             #this combines to make the url       my $url = $target_url . "$page" . "$format";             #get the page       $mech->get($url);             #get the page       my $page_content = $mech->content();             #filter the html          my $html = HTML::TokeParser->new(\$page_content);             #search and match       while(my $tag = $html->get_tag('strong')){             my $text = $html->get_trimmed_text('script');             say $text;       }                         $page++;         }       1; 
    question


    how to combine the mechanize script with the one that takes care for the
    text-to-csv-transformation.
    dilbert ;-)
    Akoya P 6512 15" OpenSuse 13.1: AMD Athlon X2 P320, 2,10 GHz, 4 GB
    Samsunng q 210, 12,1" OpenSuse 13.1: Intel® Core™ 2 Duo Proz. P8400 2,26 GHz 1066 MHz FSB 3 MB
    Hewlett Packard Satelitte ATA TOSHIBA MK8026GA : 80.0 GB; 2 GB RAM: OS 13.1

  2. #2

    Default Re: from mechanize to csv-values with Text::CSV_XS

    by the way


    there has to be some sanitizing as well..


    there has to be some iso 8859 sanitizing....


    PHP Code:
     use Text::CSV::Encodedmy $csv Text::CSV::Encoded->new ({     encoding_in  => "iso-8859-1"# the encoding comes into   Perl     encoding_out => "cp1252",     # the encoding comes out of Perl 
    dilbert ;-)
    Akoya P 6512 15" OpenSuse 13.1: AMD Athlon X2 P320, 2,10 GHz, 4 GB
    Samsunng q 210, 12,1" OpenSuse 13.1: Intel® Core™ 2 Duo Proz. P8400 2,26 GHz 1066 MHz FSB 3 MB
    Hewlett Packard Satelitte ATA TOSHIBA MK8026GA : 80.0 GB; 2 GB RAM: OS 13.1

  3. #3

    Default Re: from mechanize to csv-values with Text::CSV_XS

    Some more insights and ideas.

    Well - we have following options here:

    to print to a file instead of printing at the screen, we just have to change:

    [highlight="Perl"]say $text;[/highlight]

    to:

    [highlight="Perl"]print $OUT_FILE $text;[/highlight]

    Some explanations: where $OUT_FILE will be a filehandle for the output file that we will have to open before getting into the so called "for loop".

    This would work for the code as it is, but it might be different if we are using the Text:CSV module which has probably dedicated functions or methods for printing CSV lines to a file (Well to be frank i don't use this module and don't know it, although I should probably change this because I am using CSV files from time to time .

    Well i try to describe more in details what we want to have: Which output file to look like. Well i want the comma to separate the fields of the addresses, or the records?


    if we take this for example: katholisch.at

    we have the following dataset:

    Dom- und Metropolitanpfarre St. Stephan
    Stephansplatz 3
    1010 Wien
    Telefonnummer: 515 52-3530
    FAX-Nummer: 515 52-3720
    E-Mail: dompfarre-st.stephan@edw.or.at
    Web: Domkirche St. Stephan - Der Wiener Stephansdom
    well i want to have seperated each datset into these bits - in other words:
    if i have a dataset that delimiters and seperates the lines that are given like that

    Loosdorf Ledochowskastra�e 4 3382 Loosdorf Telefonnummer: 02754 6257 FAX-Nummer: 02754 6257-4

    i would be very very happy. Note: there also a Encoding issues is: see the Ledochowskastra�e - there is a sign in it "ß" so we have to take care for the
    iso 8859 encoding dont we!?

    Well i love if you can give some hints and helping hands. That would be very very supportive.
    Note;: this is a great gerat chance f or me to learn alot about Perl, and the options and power of Mechanize.

    Look forward to hear from you

    Many many greetings
    dilbert ;-)
    Akoya P 6512 15" OpenSuse 13.1: AMD Athlon X2 P320, 2,10 GHz, 4 GB
    Samsunng q 210, 12,1" OpenSuse 13.1: Intel® Core™ 2 Duo Proz. P8400 2,26 GHz 1066 MHz FSB 3 MB
    Hewlett Packard Satelitte ATA TOSHIBA MK8026GA : 80.0 GB; 2 GB RAM: OS 13.1

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •