path names in a PERL-script -

dilbertone · October 2, 2010, 12:50am

Hi all - hello Community,

i am new to Linux and new to PERL too. I am trying to get this perl script up and running. I have installed OpenSuse 11.3

What is wanted: I have a bunch of HTML-files, stored in a folder.
with the Perl-Script (see below) i want to parse the HTML-files.

I have stored the script to the following place:

**Basisfolder [german!!] > user > perl > **

My question is - how to name the paths …

a. to the html-folder that contains the HTML-files that need to be parsed (i named this folder html.files)
b. how to name the file that has to be created…

i suggest that this files also is located in the same directory: **Basisfolder [german!!] > user > perl > **

guess that this makes it easy…

Please do not bear with me for the Noob-Questions. If i have to explain more - please let me know!

Love to hear from your - Many thanks in advance for any and all help.

dilbertone!

see here the code…


#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use HTML::TokeParser;

# my $file = 'school.html';

my @html_files = File::Find::Rule->file->name( '*.html.files' )->in( $html_dir );
my $p = HTML::TokeParser->new($file) or die "Can't open: $!";

my %school;
while (my $tag = $p->get_tag('div', '/html')) {
        # first move to the right div that contains the information
        last if $tag->[0] eq '/html';
        next unless exists $tag->[1]{'id'} and $tag->[1]{'id'} eq 'inhalt_large';
        
        $p->get_tag('h1');
        $school{'location'} = $p->get_text('/h1');
        
        while (my $tag = $p->get_tag('div')) {
                last if exists $tag->[1]{'id'} and $tag->[1]{'id'} eq 'fusszeile';
                
                # get the school name from the heading
                next unless exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'fm_linkeSpalte';
                $p->get_tag('h2');
                $school{'name'} = $p->get_text('/h2');
                
                # verify format for school type
                $tag = $p->get_tag('span');
                unless (exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'schulart_text') {
                        warn "unexpected format: parsing stopped";
                        last;
                }
                $school{'type'} = $p->get_text('/span');
                
                # verify format for address
                $tag = $p->get_tag('p');
                unless (exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'einzel_text') {
                        warn "unexpected format: parsing stopped";
                        last;
                }
                $school{'address'} = clean_address($p->get_text('/p'));
                
                # find the description
                $tag = $p->get_tag('p');
                $school{'description'} = $p->get_text('/p');
        }
}

print qq/$school{'name'}
/;
print qq/$school{'location'}
/;
print qq/$school{'type'}
/;

foreach (@{$school{'address'}}) {
        print "$_
";
}

print qq/
Description: $school{'description'}
/;

sub clean_address {
        my $text = shift;
        my @lines = split "
", $text;
        foreach (@lines) {
                s/^\s+//;
                s/\s+$//;
        }
        return \@lines;
}

Love to hear from you!

udaman · October 2, 2010, 1:30am

I’m not sure exactly what you’re asking in your question, but I’ll give an answer based on what I think your question might be.

If you want to parse a set of filenames into an array, open a directory and read the contents of that directory into the array. In Linux, a folder is called a directory.

opendir (THISDIR, $HOME) or warn "Could not open the dir ".$HOME.": $!";
@allfiles = grep !/^\.\.?$/, readdir THISDIR;
closedir THISDIR;

The opendir directive opens the contents of $HOME into the array @allfiles. You can then print the array to STDOUT to test that it is true.

$HOME can be declared at the top of your program like this:

my $HOME ="/home/username/htmlfiles";

#or whatever your path actually is.

I hope this helps.

dilbertone · October 2, 2010, 11:03am

hello Udaman - many thanks for the quick reply.

my question is regarding the I-O handle. I have to find the right path names. Names and conventions that match the linux conventions…

i took your example and made some slight corrections…

udaman:

I’m not sure exactly what you’re asking in your question, but I’ll give an answer based on what I think your question might be.

If you want to parse a set of filenames into an array, open a directory and read the contents of that directory into the array. In Linux, a folder is called a directory.
opendir (THISDIR, $HOME) or warn "Could not open the dir ".$HOME.": $!";
@allfiles = grep !/^\.\.?$/, readdir THISDIR;
closedir THISDIR;
The opendir directive opens the contents of $HOME into the array @allfiles. You can then print the array to STDOUT to test that it is true.

$HOME can be declared at the top of your program like this:
my $HOME ="/home/username/htmlfiles";  
#or whatever your path actually is.

I hope this helps.

i wrote this

perl_script_three.pl

#!/usr/bin/perl

use strict;

use warnings;

use diagnostics;

use File::Find::Rule;


my $HOME ="home/usr/perl/html.files";
opendir (THISDIR, $HOME) or warn "Could not open the dir ".$HOME.": $!";
@allfiles = grep !/^\.\.?$/, readdir THISDIR;
closedir THISDIR;

response:

suse-linux:/usr/perl # perl perl_script_three.pl
Global symbol “@allfiles” requires explicit package name at perl_script_three.pl line 10.
Execution of perl_script_three.pl aborted due to compilation errors (#1)
(F) You’ve said “use strict” or “use strict vars”, which indicates
that all variables must either be lexically scoped (using “my” or “state”),
declared beforehand using “our”, or explicitly qualified to say
which package the global variable is in (using “::”).

Uncaught exception from user code:
Global symbol “@allfiles” requires explicit package name at perl_script_three.pl line 10.
Execution of perl_script_three.pl aborted due to compilation errors.
at perl_script_three.pl line 12
suse-linux:/usr/perl #

i am not sure - have i done something wrong!?

Any and all help is greatly appreciated

dilbertone:)

ken_yap · October 2, 2010, 11:29am

When you use strict; you must declare all variables before use instead of relying on Perl to let you create them on first use, which could hide errors in the program. The quickest fix is to add my in front of the first @allfiles, i.e.

my @allfiles = grep !/^…?$/, readdir THISDIR;

udaman · October 2, 2010, 1:52pm

ken_yap is correct by adding the ‘my’ in front of the array. I usually define all my variables, including arrays at the beginning of the program, and so when I cut and pasted that snippet of code, the definition wasn’t evident. As you defined the ‘my $HOME’ variable you should have defined the ‘my @allfiles’ variable, but only the first time it’s used.

Are you trying to split out the path of a given file? Like a file lives in “/home/usr/perl/html.files/file1.html”. You can pop that path into an array, split it on the last slash (/), and drop the file name, leaving the path to it in the array. Or find a Perl module that does the work for you.

dilbertone · October 2, 2010, 5:14pm

Hello again you both - many thanks for the great and supportive help!

i have reworked the two scripts – that are being created in order to find the right paths but i have no luck. The scripts are **placed in **

home > usr > perl

i have the two scripts
a. perl_script_two.pl
b. perl_script_three.pl

And there also is the directory with the 20000 html-files. See further above.

Well i found out that i made some mistakes while talking bout the html-files: Note: there are more than 20 000 Html files in the directory that is called htmlfiles Note i renamed it to htmlfiles - instead of html.files
Imortant: But the files itself are all named like the following sheme:

einzelergebnis1…
einzelergebnis2…
einzelergebnis3a…
einzelergebnis3b…
einzelergebnis3d…

You can see this in a consequent regard in the script two: Here i name it accordingly…

my @files = File::Find::Rule->file()
                 ->name('einzelergebnis*.html')

So here we go: i start them in the console like the following and get the following results: see below!

suse-linux:/usr/perl # perl perl_script_two.pl


#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use File::Find::Rule;
my @files = File::Find::Rule->file()
                 ->name('einzelergebnis*.html')
                 ->in( '/home/usr/perl/htmlfiles' );
foreach my $file(@files) {
        print $file, "
";

}

**Results: **
Can’t stat /home/usr/perl/htmlfiles: No such file or directory
at /usr/lib/perl5/site_perl/5.12.1/File/Find/Rule.pm line 594

perl_script_three.pl

suse-linux:/usr/perl # perl perl_script_three.pl


#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use File::Find::Rule;

my $HOME ="/home/usr/perl/htmlfiles";
opendir (THISDIR, $HOME) or warn "Could not open the dir ".$HOME.": $!";
my @allfiles = grep !/^\.\.?$/, readdir THISDIR;
closedir THISDIR;

Results:
Could not open the dir /home/usr/perl/htmlfiles: No such file or directory at perl_script_three.pl line 9.
readdir() attempted on invalid dirhandle THISDIR at perl_script_three.pl line
10 (#1)
(W io) The dirhandle you’re reading from is either closed or not really
a dirhandle. Check your control flow.
closedir() attempted on invalid dirhandle THISDIR at perl_script_three.pl line
11 (#2)
(W io) The dirhandle you tried to close is either closed or not really
a dirhandle. Check your control flow.

so i am a bit clueless -

Perhaps i should take a more simple script for these preliminary tests…

Look forward to hear from you!

regards
dilbertone:)

udaman · October 2, 2010, 9:05pm

Did you read the error message? In both cases, the error message is saying it can’t find the directory. Sounds like it doesn’t exist. Check that /home/usr/perl/htmlfiles/ is exactly as you think it is. Post here the output of

ls -d /home/usr/perl/htmlfiles

dilbertone · October 2, 2010, 9:30pm

Hello udaman! good evening!

many thanks to you! I did as you adviced me! I will post those results later this evening!!

in the meantime i will provide you with some first results i have gained so far:

you remeber the script that i have introduced further above: (see also below)

i tried replacing the “in” to look in the same directory as the script assuming it’s in the same directory ->in( ‘.’ );

That means: i changed from …


#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use File::Find::Rule;
my @files = File::Find::Rule->file()
                 ->name('einzelergebnis*.html')
                 ->in( '/home/usr/perl/htmlfiles' );
foreach my $file(@files) {
        print $file, "
";

}

**to this **


#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use File::Find::Rule;
my @files = File::Find::Rule->file()
                 ->name('einzelergebnis*.html')
                 ->in( '.' );
foreach my $file(@files) {
        print $file, "
";

}

**and then i got the following output: **

htmlfiles/einzelergebnis80b5.html
htmlfiles/einzelergebnisa0ef.html
htmlfiles/einzelergebnis1b42.html
htmlfiles/einzelergebnis5960.html
htmlfiles/einzelergebnise523.html
htmlfiles/einzelergebnis2c7e.html
htmlfiles/einzelergebnisdf57.html
htmlfiles/einzelergebnis2b53-2.html
htmlfiles/einzelergebnisb1c0-2.html
…and 22 thousand lines further…

This seems to be the** starting point!** now i can continue figuring out how i have to configure the script of - see more below. So after having nailed down the I-O handle-issues and the path names in General the parser-script (see below) has to be configured. All following ideas should be regarding this HTML-parser-script:

Well, this means i have

a. to define the paths in **$file the file/directory ** incl. path and furthermore …
b. to define a path in $html_dir

In other words - i need to define the paths to

a. the directory that contains the files that need to be parsed - see above.
b. the path to the file that has to be created.

The first task can be solved if i take some gained knowledge out of the preliminary-tasks - see above.
That means: i have to look for the files in the **directory that is called “htmlfiles” **

Does that mean i have to change this following line!?

 my $file = 'school.html';

**BTW **– what does the Array @html_files do?


#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser;

my $file = 'school.html';
my $p = HTML::TokeParser->new($file) or die "Can't open: $!";

my %school;
while (my $tag = $p->get_tag('div', '/html')) {
	# first move to the right div that contains the information
	last if $tag->[0] eq '/html';
	next unless exists $tag->[1]{'id'} and $tag->[1]{'id'} eq 'inhalt_large';
	
	$p->get_tag('h1');
	$school{'location'} = $p->get_text('/h1');
	
	while (my $tag = $p->get_tag('div')) {
		last if exists $tag->[1]{'id'} and $tag->[1]{'id'} eq 'fusszeile';
		
		# get the school name from the heading
		next unless exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'fm_linkeSpalte';
		$p->get_tag('h2');
		$school{'name'} = $p->get_text('/h2');
		
		# verify format for school type
		$tag = $p->get_tag('span');
		unless (exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'schulart_text') {
			warn "unexpected format: parsing stopped";
			last;
		}
		$school{'type'} = $p->get_text('/span');
		
		# verify format for address
		$tag = $p->get_tag('p');
		unless (exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'einzel_text') {
			warn "unexpected format: parsing stopped";
			last;
		}
		$school{'address'} = clean_address($p->get_text('/p'));
		
		# find the description
		$tag = $p->get_tag('p');
		$school{'description'} = $p->get_text('/p');
	}
}

print qq/$school{'name'}
/;
print qq/$school{'location'}
/;
print qq/$school{'type'}
/;

foreach (@{$school{'address'}}) {
	print "$_
";
}

print qq/
Description: $school{'description'}
/;

sub clean_address {
	my $text = shift;
	my @lines = split "
", $text;
	foreach (@lines) {
		s/^\s+//;
		s/\s+$//;
	}
	return \@lines;
}

i look forward to any and all help! I really appreciate a helping hand here… Many many thanks for all you did so far! This is a great place for knowledge sharing!!

metabo

dilbertone · October 2, 2010, 10:01pm

Hello Udaman

regarding your question - here an answer:

suse-linux:/usr/perl # cd /usr^C
suse-linux:/usr/perl # ls -d /home/usr/perl/htmlfiles
ls: cannot access /home/usr/perl/htmlfiles: No such file or directory
suse-linux:/usr/perl #

the same is to this command:
ls -al /home/usr/perl/htmlfiles/

Udaman: what does this mean to the following task - the task to configure the HTML-Parser script? (see above!)

I need to define the paths in $file the file/directory incl. path and furthermore to define a path in $html_dir

It is a bit confusing! did i have done something wrong !? Why do i get such results…

suse-linux:/usr/perl # cd /usr^C
suse-linux:/usr/perl # ls -d /home/usr/perl/htmlfiles
ls: cannot access /home/usr/perl/htmlfiles: No such file or directory
suse-linux:/usr/perl #

i do not understand this

regards dilbert

udaman · October 2, 2010, 11:08pm

The answer is very simple. The directory that you have the script in “.” is not the same directory that your script is looking in “/home/usr/perl/htmlfiles”. That directory doesn’t exit, that’s what the error message is talking about. Either create the directory and move the files there, or use the path that the files are in.

When you are in the same directory that the script is in, do “pwd”, and that will give you the correct path to the files. Use it in your script. Before you continue with your Perl class, you should take a class in basic Unix/Linux commands.

dilbertone · October 2, 2010, 11:35pm

Hello Udaman, good evening!

Now it is clear – i misunderstood the german Word Basisordner

The german word Basisordner – in OpenSuseLinux was the directory that i thought is exactly the HOME

That is not true: The Basisordner ist not “/home” but “/”

Accordingly i leave /home in in the Skript

then we have:

suse-linux:/usr/perl # ls -al /home/usr/perl/htmlfiles/

results:

-rwxrwxrwx 1 root root 16855 Sep 22 02:37 einzelergebnisedf8.html
-rwxrwxrwx 1 root root 16893 Sep 22 04:27 einzelergebnisedfe.html
-rwxrwxrwx 1 root root 17035 Sep 22 02:55 einzelergebnisee02.html
-rwxrwxrwx 1 root root 16926 Sep 22 03:38 einzelergebnisee05-2.html
-rwxrwxrwx 1 root root 17042 Sep 22 01:03 einzelergebnisee05.html
-rwxrwxrwx 1 root root 16986 Sep 22 03:10 einzelergebnisee06.html
-rwxrwxrwx 1 root root 17784 Sep 22 03:43 einzelergebnisee08-2.html
-rwxrwxrwx 1 root root 17016 Sep 21 23:55 einzelergebnisee08.html
-rwxrwxrwx 1 root root 17456 Sep 22 00:08 einzelergebnisee0c.html
-rwxrwxrwx 1 root root 17176 Sep 22 03:36 einzelergebnisee15.html
-rwxrwxrwx 1 root root 17568 Sep 22 03:45 einzelergebnisee16.html
-rwxrwxrwx 1 root root 17216 Sep 21 23:56 einzelergebnisee18.html
-rwxrwxrwx 1 root root 17011 Sep 22 04:21 einzelergebnisee1b.html
-rwxrwxrwx 1 root root 16898 Sep 22 01:02 einzelergebnisee24.html
-rwxrwxrwx 1 root root 16992 Sep 22 04:32 einzelergebnisee29.html
-rwxrwxrwx 1 root root 16898 Sep 22 04:13 einzelergebnisee2d.html
-rwxrwxrwx 1 root root 17051 Sep 22 03:14 einzelergebnisee31.html
-rwxrwxrwx 1 root root 16922 Sep 22 04:22 einzelergebnisee35.html
-rwxrwxrwx 1 root root 17104 Sep 22 00:42 einzelergebnisee3d.html
-rwxrwxrwx 1 root root 17113 Sep 22 03:03 einzelergebnisee3e.html
-rwxrwxrwx 1 root root 16961 Sep 22 04:29 einzelergebnisee3f.html
-rwxrwxrwx 1 root root 17040 Sep 22 03:40 einzelergebnisee45.html
-rwxrwxrwx 1 root root 17027 Sep 22 00:03 einzelergebnisee4c.html
-rwxrwxrwx 1 root root 16850 Sep 22 02:56 einzelergebnisee4f-2.html
-rwxrwxrwx 1 root root 17053 Sep 22 03:55 einzelergebnisee4f-3.html
-rwxrwxrwx 1 root root 17159 Sep 22 00:56 einzelergebnisee4f.html
-rwxrwxrwx 1 root root 19650 Sep 21 23:49 einzelergebnisee55.html

and so forth ----… more than 20 000 lines…

suse-linux:/usr/perl # cd usr

now we are a step ahead. That is great!

dilbertone

dilbertone · October 2, 2010, 11:48pm

sorry for the typo:

This is correct:

suse-linux:/usr/perl # ls -al /usr/perl/htmlfiles/

this is true - i postet an wrong path in the above posting!!

sorry again…
dilbertone

Fruchtratte · October 5, 2010, 3:59pm

You probably know the difference between absolute and relative path names.

On Windows an absolute path name would be somthing like this:
C:\user\perl\file.txt
Or a relative one:
…\anotherfolder\myfile.zip

Linux has 2 difference, first, you separate folders with / and not
Second, there are nothing like C:, D: and so on.
However, every absolute path starts wit a /

So, if you want to specifiy absolute pathes you use something like this:
/user/perl/file.txt
Or relative ones like:
…/anotherfolder/myfile.zip

Hope that helps.