OCR and Linux

I currently am using the gImageReader OCR program as a front end to OCR program Teseract on openSUSE-11.2. Here is a story how I came to install this.

As part of my efforts to learn to speak French, I purchased some “Lecures CLE” beginners books in French as reading material. Unfortunately I pulled an oldcpu ‘faux pas’ and even those beginner books (with limited vocabulary (700 to 1500 MOTS (words) in each) they were a bit too advanced for my pathetic French language skills, so I decided to help myself understand the documents better I would use Google (or Babelfish) translation to help me understand better the content. But retyping the text was a bit too much.

Hence I decided to try scanning and using an OCR to create an electronic version that I could copy and paste into Google translation (it turns out it does a better translation job than Babelfish).

Note this is all done on my openSUSE-11.2 installation, but I assume this would be similar on an openSUSE-11.3 install.

xsane to scan. To scan I use the program “xsane” and I save each scanned page as a .jpg file (at 600 dpi) [also also initially as a .pbm file]. I found at 600 dpi I get superior OCR results (more on that shortly) than I do at 300 dpi. My scanner is an HP-All-In-One Premium pro, and I scan over a wireless network with Linux.

gImageReader. To do the OCR reading of these jpeg files, I use the program gImageReader which is a graphic front end to tesseract-ocr. Selecting gImageReader was not intuitively obvious, as I first tried other OCR programs as noted in this article: Linux OCR Software Comparison [splitbrain.org] However the long and short of it was I could not get them all to run or if I could get them to run, it did not work out well.

Tesseract. I first tried tesseract-ocr from an rpm available on the build service, but while I managed to install it, I could not get it to work from the command line. There is no “man tesseract” entry and I was too lazy to do many trial and error iterations to get a command line version functioning.

gocr. I then tried gocr installing various versions of it. I tried the version on the build service. But I kept getting a Segmentation fault error when I tried to run it. I tried to build it myself (from tarball) but again, I could not get gocr to work properly.

I noted then that my efforts to run Tesseract had failed, and I was thinking maybe the Segmentation fault I encountered with GOCR was because I had bad command line arguments. Maybe I needed a GUI front end? So I went looking for a Teseract/GOCR front end.

ocrgui. I tried the program ocrgui which is an OCR program front end, written in C language using the GLib and GTK+ frameworks, it supports both Tesseract and GOCR. It includes spell checking using Hunspell, an open source spell checker. So I installed Hunspell, and then tried to get ocrgui to work. However it also would not work, constantly giving me an error when ever I went to run it. I tried rebuilding from tarball, tried rebuilding from an rpm source file (of an rpm on the ocrgui site), and event tried the rpm on the ocrgui site. No job. In all cases I encountered the same error when I tried to run the program. Surfing on the error I noted others had encountered the same in the past, with no posted solution.

ocrad. So then I tried ocrad, and I installed it from an rpm on the build service. If I saved my scanned pages as .pbm files, I was then able to get ocrad to convert them to text via its OCR algorithms with the command:

 ocrad scanned-document.pbm

where ‘scanned-document’ is the saved name of the scanned document (in .pbm format).

So finally some success, but the ocrad I was using was clearly for english and was missing most of the French accents. Also, it was text/command line based, and I was curious to see if I could get a GUI version working. At this point I had to take a break, and when I came back , rather than pursue an OCRAD front end (which may have lead me to kooka (which is probably where I should have gone in the 1st place since I have a KDE-4 desktop) I decided to look for a front end for Tesseract. This was because Teseract had good reviews in the Linux OCR comparison article that I mentioned. … (still I may eventually try kooka as it is a more all in one solution, including scanning).

gImageReader installation. So while looking for a front end for Tesseract I stumbled across reference to gImageReader which is a simple PyGtk (python gtk) front-end to tesseract. Installation of gImageReader (with French, English and German dictionaries) was initially tricky for me. I could find no prepackaged rpm. I did find the tarball and also a Fedora rpm and a .deb package. There was no .src rpm. I did not like the idea of installing the Fedora rpm, nor converting the .deb to an rpm with Alien, so I decided to build from the tarball. After a couple of iterations to get what I had thought was all the dependencies, I did a successful build. Unfortunately I could NOT get checkinstall to build an rpm, so I had to do an installation with “make install” (and no rpm).

But gImageReader would not run. It kept crashing as soon as I tried to run it. In the end I came to the realization that I was missing python-enchant, and so I searched for and found python-enchant on the build service. While looking at the repository with python-enchant

http://download.opensuse.org/repositories/openSUSE:/11.2:/Contrib/standard/x86_64/

I noted it had its own packaged versions of tesseract, and also the language files for tesseract for French (tesseract-data-fra-2.04-1.1.x86_64.rpm) and German (tesseract-data-deu-2.04-1.1.x86_64.rpm) so I installed those [there is also Italian, Dutch and Spanish languages for the tesseract OCR program on that site].

This time it worked.

gImageReader as a teseract front end does a better job than ocrad.

Here is a screen shot:
http://thumbnails19.imagebam.com/10344/a953c4103433962.jpg](http://www.imagebam.com/image/a953c4103433962)

Note while I get an error that the French dictionary is not installed, I do appear to have the French OCR language installed.

Are you aware that Google Docs offers an OCR service? You just upload an image of text (higher resolution gives better results of course).

Google Docs OCR

No I was not aware of it. I took a look at that page, surfed it a bit, and came away puzzled as to how to use it (probably puzzled cause I know nothing about Google Docs). :frowning: … fortunately gImageReader is working for me for now. I may try Kooka later.

A lot of work from your side. The result, as you show it, is not bad, but the à is still made into an a.

Actually, how come Kooka does not support tesseract? Wouldn’t it be logical to build a gui totally modular in order to just be able to change different engines? Just a stupid question from a non programmer. (A bit like having Kopete running only with one protocoll instead as having it as it is - on software for many protocols).

Most of the time the " à " IS made into an " a " but the reliability is not 100%.

To the best of my knowlege kooka does NOT support tesseract.

I just finished installing and trying kooka. As near as I can determine there is no KDE4 version, so it was an older KDE3 version that is on the unstable KDE build service. I could not find a French language pack for it. So while kooka is kind of neat in that it has the scanning function built in, its OCR failed miserably with the French language. I could not see an easy way to have kooka include the French language in the OCR functionality (nor the German langauge).

I don’t know. Maybe. I’m just a new user to this, and not a programmer.

And I have a specific immediate task of going through these books I purchased to help me learn the French language.

There is a service menu for KDE4 and Tesseract:
http://kde-apps.org/content/show.php/OCR+using+Tesseract?content=121289

Works fine for me.

I thought that, as of mid 2007, Kooka was dead, or at least stagnant.

There is more on gimagereader in this thread: How does one meet a python2-devel dependency requirement on openSUSE-11.3

I replaced my openSUSE-11.2 installation with openSUSE-11.3, and then I struggled building gimagereader. malcolmlewis kindly built the package for me, and it works well !

I also installed ispell-french and ispell-german (with appropriate French and German) language dictionaries, while it appears ispell-american language dictionary was already in place. I may change ispell-american to ispell-british, as I work in Europe, and with myself Irish-Canadian, arguably ispell-british is closer to the spelling I should use. In any case, the spell check works so as to help me correct the OCR output.

Its working very nice for me now! I thus recommend gImageReader packaged by malcolmlewis. His repository is here for 11.3 (and the application on his repository is called python-gimagereader):

http://download.opensuse.org/repositories/home:/malcolmlewis:/Python/openSUSE_11.3/ 

Note one first needs tesseract and python-enchant installed. I installed tesseract and python-enchant from this repository:

http://download.opensuse.org/repositories/openSUSE:/11.3:/Contrib/standard/

Here are a couple of pix/screenshots of gimagereader in action illustrating the spell check:

http://thumbnails31.imagebam.com/11421/83420c114207270.jpg](http://www.imagebam.com/image/83420c114207270)

http://thumbnails14.imagebam.com/11421/ca166e114207307.jpg](http://www.imagebam.com/image/ca166e114207307)
click on the images for a larger view.

I should mention that I have gimagereader and tesserct combination working with version 2.04-1.2 of tesseract. I have not been able to get this working with version-3 of tesseract.

On 2011-01-08 11:36, oldcpu wrote:
> Here are a couple of pix/screenshots of gimagereader in action
> illustrating the spell check:
>
> ‘[image: http://thumbnails31.imagebam.com/11421/83420c114207270.jpg]’
> (http://www.imagebam.com/image/83420c114207270)

Isn’t the resolution a bit low? I mention because I see a “:” confused with
a “z” in the first sentence. Even with that, I think it works quite well.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

I don’t think it was the resolution. More likely the OCR recognition settings. I’m just starting to learn now it can be tuned a bit. There are different spell check and translation libs (such as aspell, ispell, gtkspell, tesseract-data-fr, tesseract-data-de) and I think those make a difference as to the options available.

I suspect if I had this running with the newer tesseract version 3 it would have better recognition, but thus far I can only get version 2 running, mainly I think because I don’t have tesseract-data-<languages> built for version 3 for english, german, nor french.

I also installed gimagereader on my 32-bit openSUSE-11.3 LXDE PC (an old athlon-1100 w/1GB RAM). It was a bit difficult, because although it installed ok, when I tried to run I kept getting a “NO GTK found” error, and I could not for a while figure out what was causing that, as I had all sorts of GTK apps/libs installed. In the end, after an install of python-opengl, python-wxGTK-examples, python-poppler, and libwxsvg0 it worked. I don’t know which of those allowed it to work.

I note my 64-bit openSUSE-11.3 KDE-4.4.4 PC does not have “libwxsvg0”, nor “python-wxGTK-examples”, nor “python-opengl”. But gimagereader works on my 64-bit openSUSE-11.3 KDE-4.4.4 PC. That leaves “python-poppler” as an app that helped this work, but I find that a bit difficult to believe. Still, I do note in the tarball for gimagereader that “python-poppler” is a gimagereader dependency requirement to build from the tarball, so maybe it is a requirement to run also.

If it was ‘python-poppler’ that was initially stopping gimagereader to run on my 32-bit LXDE desktop, the error message “No GTK found” was not very descriptive !

Anyway, gimagereader works now on LXDE on this old PC and thats great to see. My thanks again to malcomlewis for packaging gimagereader.

I made a mistake in the quoted post. It was an athlon-1100 and NOT an athlon-2800.

I installed gimagereader on a 3rd PC, this being my 32-bit openSUSE-11.3 KDE-4.4.4 athlon-2800 PC. This install went smoother, because I had a better idea as to what was needed.

When I started, I note the following applications (that I think might be needed) were already installed from OSS repository

  • python
  • python-gtk
  • python-imaging
  • wxGTK
  • wxGTKcompat
  • wxGTKgl
  • poppler-data (not sure this was necessary)
  • poppler-tools (not sure this was necessary)
  • gtkspell
  • ispell
  • ispell-american
  • aspell-en
  • hunspell-tools

I also installed from OSS the following:

  • python-imaging-sane
  • gtkspell-lang
  • ispell-french
  • ispell-german
  • aspell-fr
  • aspell-de

I also added the repository:

http://download.opensuse.org/repositories/openSUSE:/11.3:/Contrib/standard/
  • tesseract
  • tesseract-data-fra
  • tesseract-data-deu
  • python-enchant

and I added malcomlewis’s python repos:

http://download.opensuse.org/repositories/home:/malcolmlewis:/Python/openSUSE_11.3/ 

from which I installed:

  • python-poppler
  • python-gimagereader

As per my practise, after installing from the two new repositories, I then removed the repositories from zypper/yast.

Its possible I installed more apps than needed, but gimagereader does work well now on this 32-bit openSUSE-11.3 KDE-4.4.4 PC.