python-parser runs great - now applyin on a new target [howTo]

Hello dear Opensuse-Fans good evening

can this be applied on another target?


import urllib
import urlparse
import re

url = "http://search.cpan.org/author/?W"
html = urllib.urlopen(url).read()
for lk, capname, name in re.findall('<a
href="(/~.*?/)"><b>(.*?)</b></a><br/><small>(.*?)</small>', html):
    alk = urlparse.urljoin(url, lk)

    data = { 'url':alk, 'name':name, 'cname':capname }

    phtml = urllib.urlopen(alk).read()
    memail = re.search('<a href="mailto:(.*?)">', phtml)
    if memail:
        data'email'] = memail.group(1)

    print data 

Note it runs very very good. All is nice. Now i want to apply it on a new targt.

BTW - i can learn alot with this …
Let us say on this swiss-site:

educa.ch - Katalog nach Kanton und Stufe

But how should we fetch the sites - that is the problem…
Toolbox - it is like a cookbook… i want to learn while applying the
code…

What is necessary to apply the example on the target!?

BTW - should i fetch the pages and load them into an array or should i loop over the

educa.ch
educa.ch
educa.ch

look forward to hear from you

May be this will help you:

import urllib
import urlparse
import re

url = “http://www.educa.ch/dyn/
html = urllib.urlopen(“http://www.educa.ch/dyn/79362.asp?action=search”).read()
for capname, lk in re.findall(’<a name="\d+"></a><br><img ^>]+>(^<]+).*?<a href="#\d+" onclick="javascript: window.open(’(\d+.asp?id=\d+)’’, html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('&lt;a href="mailto:(.*?)"&gt;', phtml)
if memail:
    data'email'] = memail.group(1)

print data  

PS: I didn’t try this, you’ll possibly need to debug the regexp
PPS: Didn’t find how to turn on python formatting on the page. And how to remove the smiley…

Hello - many many thanks for the help!

it is great to see such a supportive site as this one here!

i was told to use the Python Regex Tester re-try
to simplify and confirm the regex.

the Regex now can be simplified to: ‘height=“8”>(^<]+)<br.*?window.open(’(^’]+)’

and now it works

Also the second regex is invalid, and if we only pick up the email address:
memail = re.search(‘mailto:(^"]+)’, phtml)

i come back and report more -

untill soon!