running Xpather against HTML : finding & indentify the nodes

having a parser-job i need to use good tools!

i like the idea of using HTML::TokeParser::Simple and DBI. in order to do a parser - job. with additional storage of the results!

I have little experience with HTML::TokeParser::Simple but this task goes over
my head. Note: i also have had a look at the ideas - that seems to be also an appropiate way. But at the moment i have issues to get the correspodending xpath-expressions: I tried to determine the corresponding xpath-expressions
that needs to be filled in the Perl-programme

this is what i have:

use strict;

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

#use real file name here
open(my $fh, “<”, “file.html”) or die $!;

$tree->parse_file($fh);

my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($type) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress_two) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($telephone) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($fax) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($internet) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($officer) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($employees) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($offices) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($worker) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($country) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($the_council)= $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});

print $name->as_text;
print $type->as_text;
print $adress->as_text;
print $adress_two->as_text;
print $telephone->as_text;
print $fax->as_text;
print $internet->as_text;
print $officer->as_text;
print $employees->as_text;
print $offices->as_text;
print $worker->as_text;
print $country->as_text;
print $the_council->as_text;

Question: is this all right? BTW: See one of the example sites:

KULTUSPORTAL-BW.DE - Schuladressen

In the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.

That would be great!!

love to hear from you -

dilbert:)

ohhh - sorry - i have to correct the paths. They are wrong!

sorry! i have to proof the puddding before posting here such a posting. I will try to correct the paths…

dilbert

Hello dear folks - good evening!

this is solved! i have a solution with HTML::TableExtract

I also read the documentation for HTML::TableExtract which might help here. The HTML::TableExtract does a good job: Extracts specific tables from HTML source code. And it does that really well.

BTWi want/need to do this with a table/site: See this page: SCHULE SUCHEN EINGANG

Note: click all checkbuttons at the bottom of the site: Then you see a result-page with more than 6400 school-results: see at the right of the site Weitere Informationen anzeigen you can get detailed information if you click Weitere Informationen anzeigen

9 (or ten lines)
Schuldaten.
Schulnummer:
Amtliche Bezeichnung:
Strasse:
Plz und Ort:
Telefon:
Fax:
E-Mail-Adresse:
Schuldaten ändern] :(this is UTF8 encoded or what)
Schülergesamtzahl (this is UTF8 encoded or what)
**
Question:** can the HTML::TableExtract be applied here? At the resultpage of more than 6400 shools: (See above)

Love to hear from you

See what i have untill now:

I make Use some HTML::TableExtract


#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;

my $te = HTML::TableExtract->new( attribs => {
    
     ,
     => '',
    ,
    ,
});

$te->parse_file('myFile.html');
my ($table) = $te->tables;

for my $row ( $table->rows ) {
    cleanup(@$row);
    print "@$row
";
}

sub cleanup {
    for ( @_ ) {
        s/\s+//;
        s/\xa0 ]+\z//;
        s/\s+/ /g;
    }
}


i need tho have some help with the attributes!

Any and all help will greatly be appreciated.

regards
dilbert

each time a description of the change, there is always plenty to talk about around the world. they are not exempt. a statement usually are: "i have to change almost immediately versions, so do not expect me to date continue to attack …


smith

It is a nice update…

Thanks for sharing. Only the current status of your informative topic. I appreciate about it. It really is so useful for all, especially for me because I want to get knowledge of all kinds. I just want to say that your sense to describe, it’s nice and easy to understand, so I appreciate the love it. Thanks again!


dane