Page 1 of 2 12 LastLast
Results 1 to 10 of 15

Thread: OCR and Linux

  1. #1
    Join Date
    Mar 2008
    Location
    Europe
    Posts
    25,870
    Blog Entries
    30

    Default OCR and Linux

    I currently am using the gImageReader OCR program as a front end to OCR program Teseract on openSUSE-11.2. Here is a story how I came to install this.

    ..............

    As part of my efforts to learn to speak French, I purchased some "Lecures CLE" beginners books in French as reading material. Unfortunately I pulled an oldcpu 'faux pas' and even those beginner books (with limited vocabulary (700 to 1500 MOTS (words) in each) they were a bit too advanced for my pathetic French language skills, so I decided to help myself understand the documents better I would use Google (or Babelfish) translation to help me understand better the content. But retyping the text was a bit too much.

    Hence I decided to try scanning and using an OCR to create an electronic version that I could copy and paste into Google translation (it turns out it does a better translation job than Babelfish).

    Note this is all done on my openSUSE-11.2 installation, but I assume this would be similar on an openSUSE-11.3 install.

    xsane to scan. To scan I use the program "xsane" and I save each scanned page as a .jpg file (at 600 dpi) [also also initially as a .pbm file]. I found at 600 dpi I get superior OCR results (more on that shortly) than I do at 300 dpi. My scanner is an HP-All-In-One Premium pro, and I scan over a wireless network with Linux.

    gImageReader. To do the OCR reading of these jpeg files, I use the program gImageReader which is a graphic front end to tesseract-ocr. Selecting gImageReader was not intuitively obvious, as I first tried other OCR programs as noted in this article: Linux OCR Software Comparison [splitbrain.org] However the long and short of it was I could not get them all to run or if I could get them to run, it did not work out well.

    Tesseract. I first tried tesseract-ocr from an rpm available on the build service, but while I managed to install it, I could not get it to work from the command line. There is no "man tesseract" entry and I was too lazy to do many trial and error iterations to get a command line version functioning.

    gocr. I then tried gocr installing various versions of it. I tried the version on the build service. But I kept getting a Segmentation fault error when I tried to run it. I tried to build it myself (from tarball) but again, I could not get gocr to work properly.

    I noted then that my efforts to run Tesseract had failed, and I was thinking maybe the Segmentation fault I encountered with GOCR was because I had bad command line arguments. Maybe I needed a GUI front end? So I went looking for a Teseract/GOCR front end.

    ocrgui. I tried the program ocrgui which is an OCR program front end, written in C language using the GLib and GTK+ frameworks, it supports both Tesseract and GOCR. It includes spell checking using Hunspell, an open source spell checker. So I installed Hunspell, and then tried to get ocrgui to work. However it also would not work, constantly giving me an error when ever I went to run it. I tried rebuilding from tarball, tried rebuilding from an rpm source file (of an rpm on the ocrgui site), and event tried the rpm on the ocrgui site. No job. In all cases I encountered the same error when I tried to run the program. Surfing on the error I noted others had encountered the same in the past, with no posted solution.

    ocrad. So then I tried ocrad, and I installed it from an rpm on the build service . If I saved my scanned pages as .pbm files, I was then able to get ocrad to convert them to text via its OCR algorithms with the command:
    Code:
     ocrad scanned-document.pbm
    where 'scanned-document' is the saved name of the scanned document (in .pbm format).

    So finally some success, but the ocrad I was using was clearly for english and was missing most of the French accents. Also, it was text/command line based, and I was curious to see if I could get a GUI version working. At this point I had to take a break, and when I came back , rather than pursue an OCRAD front end (which may have lead me to kooka (which is probably where I should have gone in the 1st place since I have a KDE-4 desktop) I decided to look for a front end for Tesseract. This was because Teseract had good reviews in the Linux OCR comparison article that I mentioned. ... (still I may eventually try kooka as it is a more all in one solution, including scanning).

    gImageReader installation. So while looking for a front end for Tesseract I stumbled across reference to gImageReader which is a simple PyGtk (python gtk) front-end to tesseract. Installation of gImageReader (with French, English and German dictionaries) was initially tricky for me. I could find no prepackaged rpm. I did find the tarball and also a Fedora rpm and a .deb package. There was no .src rpm. I did not like the idea of installing the Fedora rpm, nor converting the .deb to an rpm with Alien, so I decided to build from the tarball. After a couple of iterations to get what I had thought was all the dependencies, I did a successful build. Unfortunately I could NOT get checkinstall to build an rpm, so I had to do an installation with "make install" (and no rpm).

    But gImageReader would not run. It kept crashing as soon as I tried to run it. In the end I came to the realization that I was missing python-enchant, and so I searched for and found python-enchant on the build service. While looking at the repository with python-enchant
    Code:
    http://download.opensuse.org/repositories/openSUSE:/11.2:/Contrib/standard/x86_64/
    I noted it had its own packaged versions of tesseract, and also the language files for tesseract for French (tesseract-data-fra-2.04-1.1.x86_64.rpm) and German (tesseract-data-deu-2.04-1.1.x86_64.rpm) so I installed those [there is also Italian, Dutch and Spanish languages for the tesseract OCR program on that site].

    This time it worked.

    gImageReader as a teseract front end does a better job than ocrad.

    Here is a screen shot:


    Note while I get an error that the French dictionary is not installed, I do appear to have the French OCR language installed.

  2. #2
    Join Date
    Jun 2008
    Location
    UTC+10
    Posts
    9,941
    Blog Entries
    4

    Default Re: OCR and Linux

    Are you aware that Google Docs offers an OCR service? You just upload an image of text (higher resolution gives better results of course).

    Google Docs OCR

  3. #3
    Join Date
    Mar 2008
    Location
    Europe
    Posts
    25,870
    Blog Entries
    30

    Default Re: OCR and Linux

    Quote Originally Posted by ken_yap View Post
    Are you aware that Google Docs offers an OCR service? You just upload an image of text (higher resolution gives better results of course).

    Google Docs OCR
    No I was not aware of it. I took a look at that page, surfed it a bit, and came away puzzled as to how to use it (probably puzzled cause I know nothing about Google Docs). ... fortunately gImageReader is working for me for now. I may try Kooka later.

  4. #4
    Join Date
    Jun 2008
    Location
    Netherlands
    Posts
    16,107

    Default Re: OCR and Linux

    A lot of work from your side. The result, as you show it, is not bad, but the is still made into an a.
    Henk van Velden

  5. #5
    Join Date
    Jan 2009
    Location
    Somewhere in Fictionland
    Posts
    1,645

    Default Re: OCR and Linux

    Actually, how come Kooka does not support tesseract? Wouldn't it be logical to build a gui totally modular in order to just be able to change different engines? Just a stupid question from a non programmer. (A bit like having Kopete running only with one protocoll instead as having it as it is - on software for many protocols).

  6. #6
    Join Date
    Mar 2008
    Location
    Europe
    Posts
    25,870
    Blog Entries
    30

    Default Re: OCR and Linux

    Quote Originally Posted by hcvv View Post
    A lot of work from your side. The result, as you show it, is not bad, but the is still made into an a.
    Most of the time the " " IS made into an " a " but the reliability is not 100%.

  7. #7
    Join Date
    Mar 2008
    Location
    Europe
    Posts
    25,870
    Blog Entries
    30

    Default Re: OCR and Linux

    Quote Originally Posted by stakanov View Post
    Actually, how come Kooka does not support tesseract?
    To the best of my knowlege kooka does NOT support tesseract.

    I just finished installing and trying kooka. As near as I can determine there is no KDE4 version, so it was an older KDE3 version that is on the unstable KDE build service. I could not find a French language pack for it. So while kooka is kind of neat in that it has the scanning function built in, its OCR failed miserably with the French language. I could not see an easy way to have kooka include the French language in the OCR functionality (nor the German langauge).

    Quote Originally Posted by stakanov View Post
    Wouldn't it be logical to build a gui totally modular in order to just be able to change different engines? Just a stupid question from a non programmer. (A bit like having Kopete running only with one protocoll instead as having it as it is - on software for many protocols).
    I don't know. Maybe. I'm just a new user to this, and not a programmer.

    And I have a specific immediate task of going through these books I purchased to help me learn the French language.

  8. #8
    Join Date
    Oct 2008
    Location
    near Munich
    Posts
    514

    Default Re: OCR and Linux

    There is a service menu for KDE4 and Tesseract:
    http://kde-apps.org/content/show.php...content=121289

    Works fine for me.

  9. #9
    Join Date
    Oct 2008
    Location
    Brisbane
    Posts
    299

    Default Re: OCR and Linux

    I thought that, as of mid 2007, Kooka was dead, or at least stagnant.
    It is dangerous to be right when the government is wrong.
    Thinkpad R60e, Intel Centrino Duo T2300E 1.66 GHz, Intel 3945ABG, Intel Graphics 950, 2G RAM. openSUSE 11.4 Kde 4.6, Fritz!Box 7170.

  10. #10
    Join Date
    Mar 2008
    Location
    Europe
    Posts
    25,870
    Blog Entries
    30

    Default Re: OCR and Linux

    Quote Originally Posted by oldcpu View Post
    gImageReader. To do the OCR reading of these jpeg files, I use the program gImageReader which is a graphic front end to tesseract-ocr. Selecting gImageReader was not intuitively obvious, as I first tried other OCR programs as noted in this article: Linux OCR Software Comparison [splitbrain.org] However the long and short of it was I could not get them all to run or if I could get them to run, it did not work out well.

    Tesseract. I first tried tesseract-ocr from an rpm available on the build service, but while I managed to install it, I could not get it to work from the command line. There is no "man tesseract" entry and I was too lazy to do many trial and error iterations to get a command line version functioning.

    .........
    gImageReader installation. So while looking for a front end for Tesseract I stumbled across reference to gImageReader which is a simple PyGtk (python gtk) front-end to tesseract. Installation of gImageReader (with French, English and German dictionaries) was initially tricky for me. I could find no prepackaged rpm. I did find the tarball and also a Fedora rpm and a .deb package. There was no .src rpm. I did not like the idea of installing the Fedora rpm, nor converting the .deb to an rpm with Alien, so I decided to build from the tarball. After a couple of iterations to get what I had thought was all the dependencies, I did a successful build. Unfortunately I could NOT get checkinstall to build an rpm, so I had to do an installation with "make install" (and no rpm).

    But gImageReader would not run. It kept crashing as soon as I tried to run it. In the end I came to the realization that I was missing python-enchant, and so I searched for and found python-enchant on the build service. While looking at the repository with python-enchant
    Code:
    http://download.opensuse.org/repositories/openSUSE:/11.2:/Contrib/standard/x86_64/
    I noted it had its own packaged versions of tesseract, and also the language files for tesseract for French (tesseract-data-fra-2.04-1.1.x86_64.rpm) and German (tesseract-data-deu-2.04-1.1.x86_64.rpm) so I installed those [there is also Italian, Dutch and Spanish languages for the tesseract OCR program on that site].

    This time it worked.

    gImageReader as a teseract front end does a better job than ocrad.

    Here is a screen shot:


    Note while I get an error that the French dictionary is not installed, I do appear to have the French OCR language installed.
    There is more on gimagereader in this thread: How does one meet a python2-devel dependency requirement on openSUSE-11.3

    I replaced my openSUSE-11.2 installation with openSUSE-11.3, and then I struggled building gimagereader. malcolmlewis kindly built the package for me, and it works well !

    I also installed ispell-french and ispell-german (with appropriate French and German) language dictionaries, while it appears ispell-american language dictionary was already in place. I may change ispell-american to ispell-british, as I work in Europe, and with myself Irish-Canadian, arguably ispell-british is closer to the spelling I should use. In any case, the spell check works so as to help me correct the OCR output.

    Its working very nice for me now! I thus recommend gImageReader packaged by malcolmlewis. His repository is here for 11.3 (and the application on his repository is called python-gimagereader):
    Code:
    http://download.opensuse.org/repositories/home:/malcolmlewis:/Python/openSUSE_11.3/
    Note one first needs tesseract and python-enchant installed. I installed tesseract and python-enchant from this repository:
    Code:
    http://download.opensuse.org/repositories/openSUSE:/11.3:/Contrib/standard/

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •