Page 1 of 2 12 LastLast
Results 1 to 10 of 15

Thread: OCR supporting Greek and able to embed text as layer in pdf?

Hybrid View

  1. #1
    Join Date
    Dec 2008
    Location
    Athens
    Posts
    296

    Default OCR supporting Greek and able to embed text as layer in pdf?

    Hello community,

    I am searching for an Optical Character Recognition (OCR) software that supports the Greek language and is able to embed the result (preferably after some manual corrections) to the original pdf as a layer.

    I already found tesseract (which I use with this service menu http://kde-apps.org/ and I was positively surprised by its performance, but unfortunately it cannot import the result as a text layer in the original pdf.

    CunieForm on the other side, seems to support embedding the text in the pdf file (with some script, the name of which I cannot recall), but unfortunately does not support Greek.

    Any ideas?
    Main box: OpenSuse 12.3)/KDE 4.10 64bit
    Older Box: OpenSuse 12.2/KDE 4.8.5 64bit (my mediabox)
    Laptop: Debian Wheezy/KDE 4.8.4 64bit

  2. #2
    Join Date
    Dec 2008
    Location
    Athens
    Posts
    296

    Default Re: OCR supporting Greek and able to embed text as layer in pdf?

    Ok, after a bit more searching at freecode, I found thisPDF OCR X. Of course it does not meet my needs, because:
    1. it does not seem to run on linux
    2. it is closed source.
    However, it uses tesseract as its backend (after all it is licensed under the apache 2.0 license) and is written in java, which should mean that what I am searching for, propably exists.
    Main box: OpenSuse 12.3)/KDE 4.10 64bit
    Older Box: OpenSuse 12.2/KDE 4.8.5 64bit (my mediabox)
    Laptop: Debian Wheezy/KDE 4.8.4 64bit

  3. #3
    Join Date
    Dec 2008
    Location
    Athens
    Posts
    296

    Default Re: OCR supporting Greek and able to embed text as layer in pdf?

    Well, here is one more indication that tesseract should be able to be used for embedding the text file back to the pdf as a layer.
    since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page layout analysis.
    (from Wikipedia). Maybe someone knows how to use hocr2pdf to do this? (or maybe only where to find and install hocr2pdf).
    Main box: OpenSuse 12.3)/KDE 4.10 64bit
    Older Box: OpenSuse 12.2/KDE 4.8.5 64bit (my mediabox)
    Laptop: Debian Wheezy/KDE 4.8.4 64bit

  4. #4
    Join Date
    Dec 2008
    Location
    Athens
    Posts
    296

    Default Re: OCR supporting Greek and able to embed text as layer in pdf?

    One more potentially useful application I found for what I am trying to do seems to be pdfbeads. I have not tried that yet, but I will report back as soon as I do.
    Main box: OpenSuse 12.3)/KDE 4.10 64bit
    Older Box: OpenSuse 12.2/KDE 4.8.5 64bit (my mediabox)
    Laptop: Debian Wheezy/KDE 4.8.4 64bit

  5. #5
    Join Date
    Dec 2008
    Location
    Athens
    Posts
    296

    Default Re: OCR supporting Greek and able to embed text as layer in pdf?

    OK, I have installed pdfbeads (it is a ruby gem). However, I have not figured out yet how to make tesseract export an hocr file. Any clues?
    Main box: OpenSuse 12.3)/KDE 4.10 64bit
    Older Box: OpenSuse 12.2/KDE 4.8.5 64bit (my mediabox)
    Laptop: Debian Wheezy/KDE 4.8.4 64bit

  6. #6
    Join Date
    Dec 2008
    Location
    Athens
    Posts
    296

    Default Re: OCR supporting Greek and able to embed text as layer in pdf?

    So, the tesseract version I have installed (3.00 from here openSUSE:/Factory:/Contrib/openSUSE_12.1/ does not support exporting as hocr. I will have to try compiling it from source.
    Main box: OpenSuse 12.3)/KDE 4.10 64bit
    Older Box: OpenSuse 12.2/KDE 4.8.5 64bit (my mediabox)
    Laptop: Debian Wheezy/KDE 4.8.4 64bit

  7. #7
    Join Date
    Feb 2009
    Location
    Spain
    Posts
    25,547

    Default Re: OCR supporting Greek and able to embed text as layer in pdf?

    On 2012-03-20 11:46, Nikos78 wrote:
    >
    > OK, I have installed pdfbeads (it is a ruby gem). However, I have not
    > figured out yet how to make tesseract export an hocr file. Any clues?


    I see your several posts. I have nothing to offer, sorry, I just post to
    say that some are reading them :-)

    I have tried OCR in the past, went away very discouraged. I think you are
    having better luck than me.

    --
    Cheers / Saludos,

    Carlos E. R.
    (from 11.4 x86_64 "Celadon" at Telcontar)

  8. #8
    Join Date
    Dec 2008
    Location
    Athens
    Posts
    296

    Default Re: OCR supporting Greek and able to embed text as layer in pdf?

    just post to
    say that some are reading them :-)
    thank you, I was about to start my next post with "Dear Diary"..

    Anyway, I tried compiling from source. I wanted to create an rpm so it is easier to uninstall or update if needed. The source archive contains a .spec file, but it seems to be for the 3.0.0 version. Hence, I seem to have reached a dead-end until somebody packages tesseract 3.0.1 (or 3.0.0 with hocr functionality) for OpenSuse. (I obviously lack the skills for doing this myself).
    Main box: OpenSuse 12.3)/KDE 4.10 64bit
    Older Box: OpenSuse 12.2/KDE 4.8.5 64bit (my mediabox)
    Laptop: Debian Wheezy/KDE 4.8.4 64bit

  9. #9
    Join Date
    Feb 2009
    Location
    Spain
    Posts
    25,547

    Default Re: OCR supporting Greek and able to embed text as layer in pdf?

    On 2012-03-20 14:16, Nikos78 wrote:
    >
    >> just post to
    >> say that some are reading them :-)


    > thank you, I was about to start my next post with "Dear Diary"..
    >
    > Anyway, I tried compiling from source. I wanted to create an rpm so it
    > is easier to uninstall or update if needed. The source archive contains
    > a .spec file, but it seems to be for the 3.0.0 version. Hence, I seem to
    > have reached a dead-end until somebody packages tesseract 3.0.1 (or
    > 3.0.0 with hocr functionality) for OpenSuse. (I obviously lack the
    > skills for doing this myself).


    There is a possibility to create the rpm with "checkinstall". The idea is
    to replace the "make install" phase with "checkinstall", which captures
    data from the make install process and creates an rpm on the fly.

    It has two bugs, though: one is that it fails to create the destination
    directories. The trick is to do a "make install" first followed by a
    "checkinstall".

    The other is that it may create a faulty .spec file, with a "requires"
    field that contains only a comma (,). The trick then is to edit
    /etc/checkinstallrc and change "REVIEW_SPEC" to 1. In mid run it pops up
    with /the/ editor and you can remove the faulty line.

    The editor it uses is "vi". You can change that by changing the EDITOR
    environment variable (for root).


    But I would not do all this unless you know that you want to keep that
    program ;-)


    The purists will say that you should instead learn to create spec files and
    use the buildservice. To each his own, I use checkinstall ;-)

    --
    Cheers / Saludos,

    Carlos E. R.
    (from 11.4 x86_64 "Celadon" at Telcontar)

  10. #10
    Join Date
    Dec 2008
    Location
    Athens
    Posts
    296

    Default Re: OCR supporting Greek and able to embed text as layer in pdf?

    I tried the above procedure, but to no avail (thanks anyway though, since I used checkinstall to install another package).
    The good news is, I managed to install tesseract 3.0.1 from here. And it does output hocr.
    So I am back on track for putting the hocr back to the pdf.

    Of course, any idea in this direction is highly appreciated.
    Main box: OpenSuse 12.3)/KDE 4.10 64bit
    Older Box: OpenSuse 12.2/KDE 4.8.5 64bit (my mediabox)
    Laptop: Debian Wheezy/KDE 4.8.4 64bit

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •