combining two parts of a script - how to invoke a variable between them

hello everybody!

This is just basic script, where i try to modify for the needs. I try to play with it. i want to parse some data. The whole script has three parts:

Note: This thread takes up some great help of a very very gentil and supportive user:mmarif4u see the following thread here: @ mmarif4u many many thanks to you!!
database connection - how to store the datas of a parser in a MySQL_DB?

the three parts are:

  1. Fetching
  2. parsing
  3. storing

i want to put all into one script: Two are allready put together - there all seems to be clear…
So this thread is one that asks for the combining two parts of a script - how to invoke a variable between them

**
What has happened untill now: **
1st i need to have a connection to database lets say MySQL. I will suggest to use mysqli instead of mysql.
Well - okay i safe this db.php


$host="localhost"; //database hostname
$username="******"; //database username
$password="******"; //database password
$database="******"; //database name
?>

Now i am going to take a new script and save this config.php

<?php
require_once('db.php'); //call db.php
$connect=mysqli_connect($host,$username,$password); //connect to mysql through mysqli
if(!$connect){
die("Cannot connect to host, please try later."); //throw error if any problem
}
else
{
$select_db=mysqli_select_db($database); //select database
if(!$select_db){
die("Site Database is down at the moment, Please check later. We will be back shortly."); // error if cannot connect to database or db does not exist
}
}
?>

Now i have to take care for the script, that takes the files (note this is very basic - it is only a proof of concept.
In the real situation i will take cURL since cURL is much much nicer and more elegant and faster etc.

<?php
require_once('config.php'); // call config.php for db connection
$content = file_get_contents("**<-here the path to the file goes in-> Position XY! an URL is here **");

var_dump($content);

$pattern = '/<td>(.*?)<\/td>/si';
preg_match_all($pattern,$content,$matches);

foreach ($matches[1] as $match) {
    $match = strip_tags($match);
    $match = trim($match);
    var_dump($match);
$sql = mysqli_query("insert into tablename(contents) values ('$match')");
}

?> 

Note: This is just basic script, where you can modify it for your taste and can play with it.

Question: If i have stored the URLs that i want to parse in a local file - how do i “call” them in the script.
How do i do the call to file where the URLs (there are more than 2500 URLs that have to be parsed) at the
following position: $content = file_get_contents("**<-here the path to the file goes in-> Position XY! an URL is here **");

The folder with the URLs is stored in the same folder as the scripts reside!

Many thanks for all hints and for a starting point!

if i have to write more - or if you need more infos - or if i have to be more concrete, just let me know!

i love to hear from you!
db1:)

As far as I can understand, you are geeting first colomn of a table, then stroing the content into a database, I am right. The part I don’t understand quite much is, what exactly are you going to do with these content. I guess they are urls. Are you trying somehow index the content of that pages into a database?

Hi dilbertone again :slight_smile: [just read your mail, posting here that other users to get benefit from it]

Lets go with this step by step.
1st you want to combine all the scripts?..does it mean you want to combine the database connection script into your main script. If so, i will suggest let it be, will be easy for you later when you have many files and calling the same connection script.

2nd you want to include the urls from a file?..or database?
Lets say file… Now it totally depends how the data is stored in the file. I will assume that each line has one url. There are many ways to read data from a file. We will go for file function ATM.
So lets the fun begin:

<?php
$filename = "/dir/url.txt";
$each_line = file($filename);
foreach($each_line as $line_num => $line)
{
echo $line."<br>";
}
?>

NOTE: save the above code as test.php or file.php for testing 1st.

You can read more here: PHP: file - Manual

@yasar11732: I really don’t think so, he has any select query to call any table or column of a table. He is inserting the contents of the urls into table… For what? that one the OP knows better.

Try the above code 1st, and let us know if that worked. We will then integrate it to your main script.

Cheers

Hello dear mmarif4u, hello yasar11732 good evening! :slight_smile:

Great to hear from you!!!

Thx @ mmarif4u - your right: i want to fetch the urls - in order to get them into the parser-script. I agree with your tipp - and work as you adviced. I run all the code in seperated files.

today i created a url.txt (/or a url2.txt), where i put the URLs in.

This is stored in the file url.txt Note i have put the file into the same directory as where the file.php resides. See below an short part of the urls ( some of the many) - that i have to parse…


http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=165463
http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=165372
http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=165645
http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=167484
http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=167496
http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=166054
http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=167654
http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=166832
http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=195431
http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=167320 

…below the output that i get **if i run the file …: **

<br>http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=165463
<br>http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=165372
<br>http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=165645
<br>http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=167484
<br>http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=167496
<br>http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=166054
<br>http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=167654
<br>http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=166832
<br>http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=195431
<br>http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=167320 
<br>http://www.schulministerium.nrw.de
<br>

Note: as for the parser see one of the page: i have a Perl Parser that does the job at the moment; It extracts the text of the table. Note; one each page i have about 10 or 11 labels and values… some of the many pages have
15 to 17 lables and values…

see one asset of information of one site…(those values i want to get)

Schuldaten
Schule hat Schulbetrieb
Schulnummer 178494
Amtliche Bezeichnung Kerschensteiner-Berufskolleg- -…
Strasse Kükenshove 1
Plz und Ort 33617 Bielefeld
Telefon 0521 1442880
Fax 0521 1442881
E-Mail-Adresse 178@schule.nrw.de
Internet Kerschensteiner Berufskolleg: Startseite
Schülergesamtzahl 535

See the code that does the job: Well - it would be great if i have a PHP Solution…

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;

my $te = HTML::TableExtract->new(
    attribs => { class => 'bp_ergebnis_tab_info' },
);

$te->parse_file('t.html');


foreach my $table ( $te->tables ) {  
	foreach my $row ($table->rows) { 
		print "   ", join(',', @$row), "
"; 
	} 
}


This gives back some code which is sanitized and cleaned up a bit
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17.
Schuldaten,
�Schule hat Schulbetrieb
Schulnummer,�143960
Amtliche Bezeichnung,�Franziskusschule Kath. Hauptschule Ahaus - Sekundarstufe I -
Strasse,�Hof zum Ahaus 6
Plz und Ort,�48683 Ahaus
Telefon,�02561 4291990
Fax,�02561 42919920
E-Mail-Adresse,�1439@schule.nrw.de
Internet,�http://www.frankusschule.de
�,�Schule in �ffentlicher Tr�gerschaft
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17.
Sch�lergesamtzahl,�648

Well it would be a great pleasure, if we could create the parser in PHP! I tried it with
PHP Simple HTML DOM Parser, but i had no luck with the CSS selectors since the pages have invalid Markup.

Do you have a idea how we can create a PHP-Parser!?

Above all this is a great learning asset and a very very great lessoin for me - i am happy to be on this forum. This place is great!

And now i am curious to hear from you

regards
dilbertone :slight_smile:

hi all again me,

just to clarify some things that might get confusing this thread.

btw the following URL:

 http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=559.5361066995808&SchulAdresseMapDO=165463

refers to a other Domain - which i also want to parse.

The parser-code below (!) refers to this URL -


http://schulnetz.nibis.de/db/schulen/schule.php?schulnr=82624&lschb=
http://schulnetz.nibis.de/db/schulen/schule.php?schulnr=82065&lschb=
http://schulnetz.nibis.de/db/schulen/schule.php?schulnr=88079&lschb=

Note the code (below) parses the table of this following site very well

http://schulnetz.nibis.de/db/schulen/schule.php?schulnr=88079&lschb=


<?php
require_once('config.php'); // call config.php for db connection
$content = file_get_contents("**<-here the path to the file goes in-> Position XY! an URL is here **");

var_dump($content);

$pattern = '/<td>(.*?)<\/td>/si';
preg_match_all($pattern,$content,$matches);

foreach ($matches[1] as $match) {
    $match = strip_tags($match);
    $match = trim($match);
    var_dump($match);
$sql = mysqli_query("insert into tablename(contents) values ('$match')");
}
?> 

well above all - you see - i need to parse sites with tables… And i want to store the data locally - in a mysqli_database. Many thanks for the help here.!!

cheers
dilbertone :slight_smile:

Hi dilbertone,

I rewritten your code and i hope it will give the expected outcome you want from it.

<?php
require_once('config.php'); // call config.php for db connection
$filename = "url.txt"; // Include the txt file which have urls
$each_line = file($filename);
foreach($each_line as $line_num => $line)
{
	$line = trim($line);
	$content = file_get_contents($line);
	//echo ($content)."<br>";
	$pattern = '/<td>(.*?)<\/td>/si';
	preg_match_all($pattern,$content,$matches);

	foreach ($matches[1] as $match) {
		$match = strip_tags($match);
		$match = trim($match);
		//var_dump($match);
		$sql = mysqli_query("insert into tablename(contents) values ('$match')");
		//echo $match;
	} 
}
?>

Hope this would help!

Good luck

hello again

many many thanks for all - all looks great!

Well i try to run the script that makes usage of mysql (!)

Note: all schould run on XAMPP on Linux . but i cannot connect to the database!!

i will try it later the weekend.

I come back and report all the findings

cheers

You are most welcome!
Let us know, if you need more help.