hello CommunitY!
i am heading for Perl-programming. I want to learn something
Well i am currently working on a small solution: I have tried various tutorials (examples of Mechanize - that i have found on the CPAN) not oll of them work - some of them are broken!
Now i try t o get some real-world-task!
Especially interesting for me as a PHP/Perl-beginner is this site in Switzerland:
see this link - click on it and see more
Which has a dataset of 2700 foundations. All the data are free to use with no limitations copyrights on it.
**what we have so far: ** Well the harvesting task should be no problem if i take WWW::Mechanize - particularly for doing the form based search and selecting the individual entries.
Hmm - i guess that the algorithm would be basically 2 nested loops: the outer loop runs the form based search, the inner loop processes the search results.
The outer loop would use the select() and the submit_form() functions
on the second search form on the page. Can we use DOM processing here.
Well - how can we get the get the selection values.
The inner loop through the results would use the follow link function to get to the actual entries using the following call.
$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);
This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.
If we have several result pages we would also use the same trick to traverse through the result pages.
For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML’s html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods.
Well the actual looping through the pages should be doable in a few lines of perl of max. 20 lines - likely less.
But wait: the processing of the entry pages will then be the most complex part
of the script.
Approaches: In principle we could do the same algorithm with a single while loop
if we use the back() function smartly.
Can you give me a hint for the beginning - the processing of the entry pages - doing this in Perl:: Mechanize
Look forward to hear from you
regards
db1