WWW::Mechanize::Firefox: debugging-attempt needs your heli & ideas

hello dear perl-fans -

well i run this script , which is written to do some screenshots of websites
i have also up and running mozrepl

here we have the file with some of the requested urls … note this is only a short snippet of the real list - the real list is much much longer. it contains more than 3500 lines and URLs


http://www.unifr.ch/sfm
http://www.zug.phz.ch
http://www.schwyz.phz.ch
http://www.luzern.phz.ch
http://www.schwyz.phz.ch
http://www.phvs.ch
http://www.phtg.ch
http://www.phsg.ch
http://www.phsh.ch
http://www.phr.ch
http://www.hepfr.ch/
http://www.phbern.ch
http://www.ph-solothurn.ch
http://www.pfh-gr.ch
http://www.ma-shp.luzern.phz.ch
http://www.heilpaedagogik.phbern.ch/

whats strange is the output - see below…
question: should i do change the script

why do i ge the output with the following little script:



#!/usr/bin/perl

use strict;
use warnings;
use WWW::Mechanize::Firefox;

my $mech = new WWW::Mechanize::Firefox();

open(INPUT, "<urls.txt") or die $!;

while (<INPUT>) {
        chomp;
        print "$_
";
        $mech->get($_);
        my $png = $mech->content_as_png();
        my $name = "$_";
        $name =~s/^www\.//;
        $name .= ".png";
        open(OUTPUT, ">$name");
        print OUTPUT $png;
        sleep (5);
}



see here the well overwhelming output - to be frank i never have thught to get such a funny output … i have to debug the whole code… see below,

http://www.unifr.ch/sfm
print() on closed filehandle OUTPUT at test_3.pl line 20, <INPUT> line 2.
http://www.zug.phz.ch
print() on closed filehandle OUTPUT at test_3.pl line 20, <INPUT> line 3.
http://www.schwyz.phz.ch
print() on closed filehandle OUTPUT at test_3.pl line 20, <INPUT> line 4.
http://www.luzern.phz.ch
print() on closed filehandle OUTPUT at test_3.pl line 20, <INPUT> line 5.
http://www.schwyz.phz.ch
print() on closed filehandle OUTPUT at test_3.pl line 20, <INPUT> line 6.
http://www.phvs.ch
print() on closed filehandle OUTPUT at test_3.pl line 20, <INPUT> line 7.
http://www.phtg.ch
print() on closed filehandle OUTPUT at test_3.pl line 20, <INPUT> line 8.
http://www.phsg.ch
print() on closed filehandle OUTPUT at test_3.pl line 20, <INPUT> line 9.
http://www.phsh.ch
print() on closed filehandle OUTPUT at test_3.pl line 20, <INPUT> line 10.
http://www.phr.ch
print() on closed filehandle OUTPUT at test_3.pl line 20, <INPUT> line 11.
http://www.hepfr.ch/
print() on closed filehandle OUTPUT at test_3.pl line 20, <INPUT> line 12.
http://www.phbern.ch                                                                                                                                                                            

Some musings: well -firstly, i think this is not a very serious error - i think i have to debug it and then it will work better.
Second, i firstly thought that the script seemed “to overload the machine”?
Now i am not very sure about that: the symptoms do look strange but i guess that it is not neecessary to conclude an “overloading of the machine”

Third, well i think of certain steps that have to be taken to ensure that the problem is at all related to WWW::Mechanize::Firefox at all?
This leads me to the point to what Perl warning means and to the idea to use the diagnostics pragma to get more explanation: what do you think?


    print() on unopened filehandle FH at -e line 1 (#2) (W unopened) An I/O operation was attempted on a filehandle that w +as never initialized. 

Well - we need to do an open(), a sysopen(), or a so +cket() call, or call a constructor from the FileHandle package

well - alternatively, print() on closed filehandle OUTPUT also gives lots of answers that will tell us that we did not use autodie and also did not check the return value of open.

well i have to debug it and make sure to find where the error comes into play

well - i have serious troubles with the stuff,

but after some musings i think that it is worth to have a closer look at all test things-,

well what do you think about the idea always test to make sure the file is open before using it.
That means that we should also get in the habit of using the three arg open():



open my $fh, '>', $name or die "Can't open file $name : $!";
print $fh $stuff;



well - i guess that we can or should work around this without using die(),
but we d have to manually have some method to let us know which files couldn’t be created. In
our case, it looks like all of them…that are shown above…

what do you thinke!?

Me thinketh that there’s something wrong in the print statement on line 20. But I guess you already thought so too.

I don’t program in perl, but I do in PHP, so it doesn’t look abracadabra to me. A trick I use in PHP if I run into trouble like this, is to check the values all along the way. Meaning, that I make the script echo the name, lineno and value of every variable.
Could it be that “[LEFT] my $png = $mech->content_as_png();” needs to be " my $png = $mech->content_as_png($_);".
Don’t you have to close the filehandle after each write?

Merely suggesting.[/LEFT]

hello knurpht, good day.

many many thanks. Well to me Perl also looks abit Abracadabra too - :wink: but i tried to use wget - but that does not work for me, since i need some rendering functions

**what is needet: **i have a list of 2,500 URLs, one on each line, saved in a file. Then i want a script - see it below - to open the file, read a line, then retrieve the website? So far so good, well i think i try something like this


use WWW::Mechanize::Firefox;    

my $mech = WWW::Mechanize::Firefox->new();     

open(INPUT, "urls.txt") or die "Can't open file: $!";     

while (<INPUT>) {       chomp;     

$mech->get($_);      

my $png = $mech->content_as_png();     }    

close(INPUT);    

exit; 


Weil i assume that this returns the given tab or the current page rendered as PNG image. All parameters are optional.
$tab defaults to the current tab. If the coordinates are given, that rectangle will be cut out. The coordinates should be a hash with the four usual entries, left,top,width,height. Hmmm - this is specific to WWW::Mechanize::Firefox.

well - i also look for **a php-approach: **

found one here: phpThumb() - The PHP thumbnail generator

phpThumb() uses the GD library to create thumbnails from images (JPEG, PNG, GIF, BMP, etc) on the fly. The output size is configurable (can be larger or

smaller than the source), and the source may be the entire image or only a portion of the original image. True color and resampling is used if GD v2.0+ is

available, otherwise paletted-color and nearest-neighbour resizing is used. ImageMagick is used wherever possible for speed. Basic functionality is

available even if GD functions are not installed (as long as ImageMagick is installed). One demo file uses portions of Javascript API by James Austin.

System Requirements:
Your website hosting provider must support:

PHP (v4.0.6 is bare minimum; v4.3.3 recommended; v5.0.0+ adds some additional filtering capabilities).
PHP GD library, ideally the bundled version that comes with PHP v4.3.0 or higher. Partially optional if ImageMagick is available.
ImageMagick. Partially optional if PHP-GD is available

what do you think about this approach;

Well i think: PHP can’t do this on it’s own as it does not include an HTML rendering library. We can find an external method of capturing the screenshots and communicate with that method using PHP, though.

First we ll need a system set up to take screenshots.

Knurpht just look into IECapt (IECapt - A Internet Explorer Web Page Rendering Capture Utility), CutyCapt (CutyCapt - A Qt WebKit Web Page Rendering Capture Utility) or khtml2png (khtml2png - Make screenshots from webpages) and let us muse to configure one of those on a system.

Then we need to set up a PHP script that will exec() the screenshot taking application and return the data to the browser.

For example:


    <?php
    $in_url = 'http://' . $_REQUEST'url']; // !!INSECURE!! In production, make sure to sanitize this input!
    $filename = '/var/cutycapt/images/' . $_REQUEST'url'] . '.png'; // Will probably need to normalize filename too, this is just an illustration
    
    // First check the file does not exist, if it does exist skip generation and reuse the file
    // This is a super simple caching system that will help to reduce the resource requirements
    if(!file_exists($filename)) {
      exec('/usr/local/bin/CutyCapt --url="' . $_REQUEST'url'] . '" --out="' . $filename . '"');
    }

    // Second check if the file exists, either from a previous run or from the above generation routine
    if(file_exists($filename)) {
      header('Content-type: image/png');
      print file_get_contents($filename);
    } else {
      header('Status: 500 Internal Server Error');
    }
    ?>

We can then call the script in the following way:

    http://localhost/screenshot.php?url=www.google.com

Building the screenshots is going to be CPU intensive so some friends strongly recommend building in some kind of file caching (ie. save the results of the output and check to see if we already have a screenshot somewhere), perhaps even a queuing system so our screenshot server does not get overwhelmed.

Knurpht - a good solution can be done in Perl and PHP either.

what do you think!? I am open to both languages…

well if we need to print out binary data (jpg file),
we need to have to set it explicitly.

We also have to close a filehandler if we do not need it anymore.,
Besides this we can use ‘or die’ on open.

Btw we need a good file name. Since i have a huge list of urls then i get a huge list of output files.
Therefore i need to have good file names. Can we reflect those things and needs in the programme!?


#!/usr/bin/perl

use strict;
use warnings;
use WWW::Mechanize::Firefox;

my $mech = new WWW::Mechanize::Firefox();

open(INPUT, "<urls.txt") or die $!;

while (<INPUT>) {
        chomp;
        next if $_ =~ m/http/i;
        print "$_
";
        $mech->get($_);
        my $png = $mech->content_as_png();
        my $name = "$_";
        $name =~s#http://##is;
        $name =~s#/##gis;$name =~s#\s+\z##is;$name =~s#\A\s+##is;
        $name =~s/^www\.//;
        $name .= ".png";
        open(my $out, ">",$name) or die $!;
        binmode($out);
        print $out $png;
        close($out);
        sleep (5);
}

what do you think!?

Knurpht wrote:
> A trick I use in PHP if I run into trouble like this,
> is to check the values all along the way. Meaning, that I make the
> script echo the name, lineno and value of every variable.

I do a lot of that in perl too. But there’s also another trick that’s
valuable in this kind of situation. Instead of:

./my-script.pl

use the debugger:

perl -d my-script.pl

and the how-to documentation starts at
http://perldoc.perl.org/perldebtut.html

hello dear djh-novell hello knurpth

many many thanks for all the ideas and suggestions. I will take care and use all your advices. I do some tests later the night and come back and report all the findings

you are great and very very supportive :wink: