I currently am using the gImageReader OCR program as a front end to OCR program Teseract on openSUSE-11.2. Here is a story how I came to install this.
As part of my efforts to learn to speak French, I purchased some “Lecures CLE” beginners books in French as reading material. Unfortunately I pulled an oldcpu ‘faux pas’ and even those beginner books (with limited vocabulary (700 to 1500 MOTS (words) in each) they were a bit too advanced for my pathetic French language skills, so I decided to help myself understand the documents better I would use Google (or Babelfish) translation to help me understand better the content. But retyping the text was a bit too much.
Hence I decided to try scanning and using an OCR to create an electronic version that I could copy and paste into Google translation (it turns out it does a better translation job than Babelfish).
Note this is all done on my openSUSE-11.2 installation, but I assume this would be similar on an openSUSE-11.3 install.
xsane to scan. To scan I use the program “xsane” and I save each scanned page as a .jpg file (at 600 dpi) [also also initially as a .pbm file]. I found at 600 dpi I get superior OCR results (more on that shortly) than I do at 300 dpi. My scanner is an HP-All-In-One Premium pro, and I scan over a wireless network with Linux.
gImageReader. To do the OCR reading of these jpeg files, I use the program gImageReader which is a graphic front end to tesseract-ocr. Selecting gImageReader was not intuitively obvious, as I first tried other OCR programs as noted in this article: Linux OCR Software Comparison [splitbrain.org] However the long and short of it was I could not get them all to run or if I could get them to run, it did not work out well.
Tesseract. I first tried tesseract-ocr from an rpm available on the build service, but while I managed to install it, I could not get it to work from the command line. There is no “man tesseract” entry and I was too lazy to do many trial and error iterations to get a command line version functioning.
gocr. I then tried gocr installing various versions of it. I tried the version on the build service. But I kept getting a Segmentation fault error when I tried to run it. I tried to build it myself (from tarball) but again, I could not get gocr to work properly.
I noted then that my efforts to run Tesseract had failed, and I was thinking maybe the Segmentation fault I encountered with GOCR was because I had bad command line arguments. Maybe I needed a GUI front end? So I went looking for a Teseract/GOCR front end.
ocrgui. I tried the program ocrgui which is an OCR program front end, written in C language using the GLib and GTK+ frameworks, it supports both Tesseract and GOCR. It includes spell checking using Hunspell, an open source spell checker. So I installed Hunspell, and then tried to get ocrgui to work. However it also would not work, constantly giving me an error when ever I went to run it. I tried rebuilding from tarball, tried rebuilding from an rpm source file (of an rpm on the ocrgui site), and event tried the rpm on the ocrgui site. No job. In all cases I encountered the same error when I tried to run the program. Surfing on the error I noted others had encountered the same in the past, with no posted solution.
ocrad. So then I tried ocrad, and I installed it from an rpm on the build service. If I saved my scanned pages as .pbm files, I was then able to get ocrad to convert them to text via its OCR algorithms with the command:
where ‘scanned-document’ is the saved name of the scanned document (in .pbm format).
So finally some success, but the ocrad I was using was clearly for english and was missing most of the French accents. Also, it was text/command line based, and I was curious to see if I could get a GUI version working. At this point I had to take a break, and when I came back , rather than pursue an OCRAD front end (which may have lead me to kooka (which is probably where I should have gone in the 1st place since I have a KDE-4 desktop) I decided to look for a front end for Tesseract. This was because Teseract had good reviews in the Linux OCR comparison article that I mentioned. … (still I may eventually try kooka as it is a more all in one solution, including scanning).
gImageReader installation. So while looking for a front end for Tesseract I stumbled across reference to gImageReader which is a simple PyGtk (python gtk) front-end to tesseract. Installation of gImageReader (with French, English and German dictionaries) was initially tricky for me. I could find no prepackaged rpm. I did find the tarball and also a Fedora rpm and a .deb package. There was no .src rpm. I did not like the idea of installing the Fedora rpm, nor converting the .deb to an rpm with Alien, so I decided to build from the tarball. After a couple of iterations to get what I had thought was all the dependencies, I did a successful build. Unfortunately I could NOT get checkinstall to build an rpm, so I had to do an installation with “make install” (and no rpm).
But gImageReader would not run. It kept crashing as soon as I tried to run it. In the end I came to the realization that I was missing python-enchant, and so I searched for and found python-enchant on the build service. While looking at the repository with python-enchant
I noted it had its own packaged versions of tesseract, and also the language files for tesseract for French (tesseract-data-fra-2.04-1.1.x86_64.rpm) and German (tesseract-data-deu-2.04-1.1.x86_64.rpm) so I installed those [there is also Italian, Dutch and Spanish languages for the tesseract OCR program on that site].
This time it worked.
gImageReader as a teseract front end does a better job than ocrad.
Here is a screen shot:
Note while I get an error that the French dictionary is not installed, I do appear to have the French OCR language installed.