OCR program?

Is there a good OCR program that works well in openSUSE?

This may help
https://forums.opensuse.org/blogs/oldcpu/opensuse-12-1-ocr-gimagereader-tesseract-77/

After much testing, I ended up using online OCR services, like Free online OCR

At the time it worked much better than any other standalone app, don’t know if things improved nowadays.

I would strongly recommend Cuneiform for OCR.

There are two options:

  1. OCRFeeder with the Cuneiform engine for Linux (click on “show unstable packages”)
  2. The original Cuneiform OCR program for Windows, running under Wine. This produces EXCELLENT results and runs quite well with Wine. It looks like it’s still available here: Index of ftp://mrclon.lianet.ru/Soft/CuneiForm

Thanks! I’d prefer not to put WINE on my system. So I’ll look into the first option.

Am NON-Technical :open_mouth:

As can see opensuse package versions of gocr or ocrad do they achieve similar results ?

Paul


linux-xfp4:~ # **zypper se ocr**
Loading repository data...
Reading installed packages...

S | Name        | Summary                                                           | Type   
--+-------------+-------------------------------------------------------------------+--------
  | gocr        | Optical Character Recognition Program                             | package
  | gocr-gui    | Optical Character Recognition Program - Basic Graphical Interface | package
  | ocrad       | Optical Character Recognition Program                             | package
  | ocrad-devel | Development files for GNU ocrad                                   | package
linux-xfp4:~ # zypper info gocr
Loading repository data...
Reading installed packages...


Information for package gocr:

Repository: openSUSE-12.1-Oss
Name: gocr
Version: 0.49-3.1.2
Arch: x86_64
Vendor: openSUSE
Installed: No
Status: not installed
Installed Size: 904.0 KiB
**Summary: Optical Character Recognition Program**
Description: 
**GOCR is an optical character recognition program. It reads images in
many formats and outputs a text file. It is also able to recognize
and translate barcodes.**
linux-xfp4:~ # zypper info ocrad
Loading repository data...
Reading installed packages...


Information for package ocrad:

Repository: openSUSE-12.1-Oss
Name: ocrad
Version: 0.21-12.1.2
Arch: x86_64
Vendor: openSUSE
Installed: No
Status: not installed
Installed Size: 290.0 KiB
**Summary: Optical Character Recognition Program**
Description: 
**GNU Ocrad is an OCR (Optical Character Recognition) program based on a feature
extraction method. It reads images in pbm (bitmap), pgm (greyscale) or ppm
(color) formats and produces text in byte (8-bit) or UTF-8 formats.
Also includes a layout analyser able to separate the columns or blocks of text
normally found on printed pages.
Ocrad can be used as a stand-alone console application, or as a backend to
other programs.**
linux-xfp4:~ # 

.

Not in my experience; go to software.opensuse.org and search for tesseract; you will find several ‘unstable’ versions - I installed the one in the LazyKent repository which should also install yagf. If not install that as well.

You then need the relevant traineddata; in my case it was eng.traineddata. I Googled for the latest 3.02 version, downloaded it and unzipped it. You have a directory /usr/share/tessdata; within the unzipped files you will find a folder called ‘tessdata’; simply copy the files in this folder into your /usr/share/tessdata directory (using su-- to acquire root privileges).

(Some of the traineddata packages are in software.opensuse.org but I couldn’t find the one I needed.)

When you open yagf for the first time change the setting for the OCR engine to tesseract.

I got results many times better from tesseract than I have ever got from gocr or ocrad. I haven’t tried cuneiform, the other engine supported by yagf.

(Cue some comments from someone with experience of both tesseract and cuneiform.)

Thought I’d try to answer my own question so I scanned in the same page at 300dpi greyscale and tried gocr, cuneiform and tesseract.

gocr found all the text but much of its output was incomprehensible; cuneiform read the middle of the page very well but failed to read any of the top and bottom paragraphs. Tesseract read the whole page with relatively few errors, most of them obvious substitutions.

Thanks for doing that test! Is there an official release of tesseract or only unstable versions?