HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files...

Good evening!

i have 5000 files which have to be parsed - in order to strip of the HTML.
The good thing is: In each HTML-file i have to get only one line of text - the following is of interest:

In line 999 i have the following results:

Well - how to do the parser-job: can i tell the HTML-Parser that i only have to get the line 999 ? Note: The data shold be stored in a database:

</p><h1>dataset 1:</h1>

 <table border="0" bgcolor="#EFEFEF"  leftmargin="15" topmargin="5"><tr>  
<td><strong>name:</strong> </td>  <td width=500> myname one         </td></tr><tr>  
<td><strong>type:</strong> </td>  <td width=500>      	type_one  (04313488)        </td></tr><tr>
<td><strong>aresss:</strong> </td><td>Friedrichstr. 70, 73430 Madrid</td></tr><tr>  
<td><strong>adresse_two:</strong> </td>  <td>      	no_value        </td></tr><tr>  
<td><strong>telefone:</strong> </td>  <td>      	0000736111/680040        </td></tr><tr>  
<td><strong>Fax:</strong> </td>  <td>      	0000736111/680040        </td></tr><tr>  
<td><strong>E-Mail:</strong> </td>  <td>      	Keine Angabe        </td></tr><tr>      
<td><strong>Internet:</strong> </td><td><a href="" target="_blank"></a><br></td></tr><tr> <td><strong>the office:</strong> </td>   
<td><a href="http://www.mysite_two" target="_blank">mysite_two </a><br></td></tr><tr> 
<td><strong>:</strong> </td><td> no_value </td></tr><tr> 
<td><strong>officer:</strong> </td>  <td> no_value        </td>  </td></tr><tr>
<td><strong>employees:</strong> </td>  <td> 259        </td></tr><tr>  
<td><strong>offices:</strong> </td>  <td>     8        </td></tr><tr>  
<td><strong>worker:</strong> </td>  <td>     no_value        </td></tr><tr>  
<td><strong>country:</strong> </td>  <td>    contryname        </td></tr><tr>  
<td><strong>the_council:</strong> </td>  <td> 

Well - the question is - is it possible to do the search in the 5000 files with this attribute: that the line 999 is of interest. In other words - can i tell the HTML-paerser that he has to look (/and extract) exactly the line 999!?

Look forward to any and all ideas.


Didn’t you have a Perl script to read the HTML? Program it to pick out the line you want.

Alternatively, use sed to get the line:

sed -n 999p filename

Hello Ken_yap

i allready have some little experience with HTML::TokeParser - but my knoweldge is not enough.

The sed command is not bad. Thx. BTW - if i want to store the results of all the parsed files in a db. Can this be done easily!?
Well see the above chunk!?
I guess that i have replace the html-tags with csv - is this doable

Write a Perl program to do that.

hello Ken YAP

well we can do it like this

  1. i Copy all the files to some working directory on OpenSuse

  2. i run perl -i.old -ne ‘print if $.==999’ *html in that directory. That extracts line 999 from the files.

  3. As the files contain now only a html-fragment we make it valid html again like this: perl -i.old -ne ‘print "<html><body>$_</body></html>’ *html

  4. now we have a collection of html-files consisting only of the previous line 999 that we can parse further with whatever tool we want. Well - i try to get things done with HTML :: TokeParser - guess that i can do it with this.

or i take simple :: csv or some other CSV-modules

i come back and report all my findings

Do it all in one script from input files to CSV output instead of creating lots of intermediate files.