OCR scan to text - initial help please.

Hi all, I have read the thread on the TW page but it suggests I start a new thread for Leap so here I am. I am running 15.3 and have been able to install tesseract-ocr from 15.3 repo without any issues. There are also various Python offerings which refer to a Google GUI but am am happy with cli for now.

My question is how do I get the output from the scanner into tesseract?

I am using an ancient HP AiO for scanning which works fine but I use it with Skanlite and only for the occasional scan, not in a production environment. Please could I have some help connecting the scanner to tesserat. I am scanning a consignment schedule of saplings for planting. Small text and all in latin as they use correct botanical species and variety names.

Any and all help gratefully received.

Budge.

I blogged about the setup for conducting OCR (with gimagereader) on an image here: https://forums.opensuse.org/entry.php/77-openSUSE-12-1-to-openSUSE-Leap-15-3-with-gImageReader-and-Tesseract

I have been using gimagereader since openSUSE-12.1 up to and including LEAP-15.3.

What I nominally do, is scan a document with an app as simple as xsane, and save the output as a jpeg. Then I import the jpg using gimagereader. gimagereader conducts the OCR to text.

The alignment/quality of the scan makes a big difference in the success … as does a decent spell check.

Still its not a 100% automatic process, and I do have to spend some time correcting words that where the OCR process was not perfect. Sometimes, I will paste the output text into a word processing program to take advantage of a superior spell check.
.

Please be aware that, the “golden rule” for “tesseract” used to be – “use .TIFF as the input file” –

  • This has been changed to “anything readable by Leptonica is supported
    ” …

Be that as it may, for “tesseract” the best OCR performance is probably (still) achieved with .TIFF input files – with this caveat for the case that, “tesseract” shall output .PDF files – <ocr - what's the best image input type for tesseract? - Stack Overflow;

  • Use small .PNG input files to produce page-by-page PDF output, which will then have to be merged.

[HR][/HR]So, to answer your question – assuming that, your main purpose is to recover the text from the physical documents –

  • Scan to .TIFF files and then, as the next step in the work-flow, feed the .TIFF files to “tesseract”.
  • Verify that the content of the text files output by “tesseract” is the same as the text of the original physical documents.

For this sort of work, it’s not usual to pipe the scanner’s output directly to “tesseract” because, the text output of “tesseract” has to be very carefully checked for errors – it’s easier to check if, the scanned .TIFF files are acceptable than to scan again and then check the “tesseract” output …

  • There are many war stories of companies and government departments scanning physical documents to computer files and then, destroying the physical documents before the content of the computer files have been checked for correctness.

If I do not already have a suitable image, I create one with at least 300dpi resolution though some text works better with 600dpi.

Assuming it is called image.png, I open a terminal, go to the folder and enter

tesseract image.png stdout

I then copy the resulting text from the screen into a suitable document to save it. (You can send it to a file but as I often copy text from a series of images, I prefer just to copy the text from the screen directly into the file.)

Where a library allows photocopying of a text, I have found that a smartphone image has sufficient resolution to provide a good source image.