OCR supporting Greek and able to embed text as layer in pdf?

Nikos78 · March 19, 2012, 8:54am

Hello community,

I am searching for an Optical Character Recognition (OCR) software that supports the Greek language and is able to embed the result (preferably after some manual corrections) to the original pdf as a layer.

I already found tesseract (which I use with this service menu http://kde-apps.org/ and I was positively surprised by its performance, but unfortunately it cannot import the result as a text layer in the original pdf.

CunieForm on the other side, seems to support embedding the text in the pdf file (with some script, the name of which I cannot recall), but unfortunately does not support Greek.

Any ideas?

Nikos78 · March 19, 2012, 12:47pm

Ok, after a bit more searching at freecode, I found thisPDF OCR X. Of course it does not meet my needs, because:

it does not seem to run on linux
it is closed source.
However, it uses tesseract as its backend (after all it is licensed under the apache 2.0 license) and is written in java, which should mean that what I am searching for, propably exists.

Nikos78 · March 20, 2012, 7:57am

Well, here is one more indication that tesseract should be able to be used for embedding the text file back to the pdf as a layer.

since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page layout analysis.
(from Wikipedia). Maybe someone knows how to use hocr2pdf to do this? (or maybe only where to find and install hocr2pdf).

Nikos78 · March 20, 2012, 9:54am

One more potentially useful application I found for what I am trying to do seems to be pdfbeads. I have not tried that yet, but I will report back as soon as I do.

Nikos78 · March 20, 2012, 11:41am

OK, I have installed pdfbeads (it is a ruby gem). However, I have not figured out yet how to make tesseract export an hocr file. Any clues?

Nikos78 · March 20, 2012, 11:57am

So, the tesseract version I have installed (3.00 from here openSUSE:/Factory:/Contrib/openSUSE_12.1/ does not support exporting as hocr. I will have to try compiling it from source.

robin_listas · March 20, 2012, 12:13pm

On 2012-03-20 11:46, Nikos78 wrote:
>
> OK, I have installed pdfbeads (it is a ruby gem). However, I have not
> figured out yet how to make tesseract export an hocr file. Any clues?

I see your several posts. I have nothing to offer, sorry, I just post to
say that some are reading them

I have tried OCR in the past, went away very discouraged. I think you are
having better luck than me.

–
Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

Nikos78 · March 20, 2012, 2:09pm

just post to
say that some are reading them
thank you, I was about to start my next post with “Dear Diary”.;).

Anyway, I tried compiling from source. I wanted to create an rpm so it is easier to uninstall or update if needed. The source archive contains a .spec file, but it seems to be for the 3.0.0 version. Hence, I seem to have reached a dead-end until somebody packages tesseract 3.0.1 (or 3.0.0 with hocr functionality) for OpenSuse. (I obviously lack the skills for doing this myself).

robin_listas · March 20, 2012, 2:43pm

On 2012-03-20 14:16, Nikos78 wrote:
>
>> just post to
>> say that some are reading them

> thank you, I was about to start my next post with “Dear Diary”.;).
>
> Anyway, I tried compiling from source. I wanted to create an rpm so it
> is easier to uninstall or update if needed. The source archive contains
> a .spec file, but it seems to be for the 3.0.0 version. Hence, I seem to
> have reached a dead-end until somebody packages tesseract 3.0.1 (or
> 3.0.0 with hocr functionality) for OpenSuse. (I obviously lack the
> skills for doing this myself).

There is a possibility to create the rpm with “checkinstall”. The idea is
to replace the “make install” phase with “checkinstall”, which captures
data from the make install process and creates an rpm on the fly.

It has two bugs, though: one is that it fails to create the destination
directories. The trick is to do a “make install” first followed by a
“checkinstall”.

The other is that it may create a faulty .spec file, with a “requires”
field that contains only a comma (,). The trick then is to edit
/etc/checkinstallrc and change “REVIEW_SPEC” to 1. In mid run it pops up
with /the/ editor and you can remove the faulty line.

The editor it uses is “vi”. You can change that by changing the EDITOR
environment variable (for root).

But I would not do all this unless you know that you want to keep that
program

The purists will say that you should instead learn to create spec files and
use the buildservice. To each his own, I use checkinstall

–
Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

Nikos78 · March 24, 2012, 2:53pm

I tried the above procedure, but to no avail (thanks anyway though, since I used checkinstall to install another package).
The good news is, I managed to install tesseract 3.0.1 from here. And it does output hocr.
So I am back on track for putting the hocr back to the pdf.

Of course, any idea in this direction is highly appreciated.

robin_listas · March 24, 2012, 6:03pm

On 2012-03-24 14:56, Nikos78 wrote:
>
> I tried the above procedure, but to no avail (thanks anyway though,
> since I used checkinstall to install another package).

Welcome.

Checkinstall is not popular with devs, they do not want to solve its bugs.
But many find it useful.

> The good news is, I managed to install tesseract 3.0.1 from ‘here’
> (http://tinyurl.com/7jdlpot). And it does output hocr.
> So I am back on track for putting the hocr back to the pdf.

Good

> Of course, any idea in this direction is highly appreciated.

No, sorry.

–
Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

Nikos78 · March 26, 2012, 3:29pm

Wanted to give once again a bit of feedback.
I successfully installed exactimage which contains hocr2pdf.
It seems to be exactly what I was searching for, but unfortunately it obviously does not handle UTF8 encoded html files. (meaning it provides a pdf with a text layer as I wanted, but the latter is garbage and not the output of tesseract).

Nikos78 · March 26, 2012, 4:30pm

is garbage

correctly positioned garbage…though.

robin_listas · March 27, 2012, 12:53am

On 2012-03-26 16:36, Nikos78 wrote:
>
>> is garbage
> correctly positioned garbage…though.

Too bad

–
Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

Nikos78 · March 27, 2012, 10:04am

My quest seems to have ended successfully so far. Thank you for joining me on this Carlos.

The solution is gscan2pdf 1.0.1 (a lot of gtk dependencies for my taste as a kde user, but nobody is perfect) with tesseract 3.0.1. It would be nice if these were available in any official OpenSuse repository, but the build service users have filled the gap. Thanks to them!