OCR-Enginge Tesseract: how to automate text recognition on a large ammount of files

Hi there - hello community,

i have a large ammount of files that i want to parse; they look like these ones: See a example:

http://www.foundationfinder.ch/ShowDetails.php?Id=134&InterfaceLanguage=&Type=Image
http://www.foundationfinder.ch/ShowDetails.php?Id=134&InterfaceLanguage=&Type=Html

well i guess that using Image::OCR::Tesseract could be interesting! I think i parse this with tesseract! ( Image::OCR::Tesseract - search.cpan.org )


    use Image::OCR::Tesseract 'get_ocr';

    my $image = './hi.jpg';

    my $text = get_ocr($image);

what do you think!?

I would write a small bash script.
Shouldn’t be more then 3 oder 4 lines if you have all images in one folder.

hello Fruchtratte

many thanks for the quick reply. Indeed i have all files in a folder. Tesseract is supposed to be one of the three most powerful OCR-engines. I am a bit unfamiliar with TA. But I try to write the script.

BTW - which one to take - the google ocr tesseract or the Perl one ( Image::OCR::Tesseract - search.cpan.org ).

Note: The google-one should fit into OpenSuse 11.3 with ease - at least i guess so!

love to hear from you.
DB1