OCR scan to text

I have a multipage phone directory that I wanted to put in a database or spreadsheet so that it is searchable. I am trying to scan it and convert it. I can convert it to pdf but cannot convert to text. I am using hplip and xsane. Trying to save as text gives me an error message that gocr is not available. I began to install that via yast but it is over 2,200 files! So, I aborted the install. Next, I tried tesseract which installs but does not seem to run. I deinstalled both.

Is there an easy way to copy these pages of phone numbers and addresses to make them searchable? The newest forum posting on ocr is at least 2 years old and didn’t seem to give me anything I didn’t already try. Some postings go back a decade!

I tried opening the pdf as a word doc. I just thought to try opening it with a spreadsheet. If anyone has had luck in doing this, please share!

I installed tesseract and tesseract -v delivered the version, so it runs.

 tesseract -v
tesseract 3.05.01
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.5.0 : libopenjp2 2.3.0


Maybe try again?

I just installed pdfsandwich from the publishing repo. The simple command

pdfsandwich -lang deu pdfname.pdf

produced a sandwich pdf and the OCR really was acceptable. And it is a command line tool, running on multiple threads, hey, that was fast!

Seems a great and easy to use software.

I did get tesseract to install. I might have had a problem confusing it with the first-person shooter of the same name!

The question remains: how do I get it to work? I see no option in LibreOffice to use it. It is not listed as an extension or on any menu that I saw. The --help provided nothing to me that appears to answer the question of running it, only setting up options. Tried running if from the CLI, but nothing happened. Adding the file name to the CLI only brought up the help list. I was hoping for a GUI or at least a GUI interface to LibreOffice.

I could not find pdfsandwich.

Ok, problem solved.

The question remains: how do I get it to work? I see no option in LibreOffice to use it. It is not listed as an extension or on any menu that I saw. The --help provided nothing to me that appears to answer the question of running it, only setting up options.

No, it is a command line tool. Read an introduction into OCR.

Tried running if from the CLI, but nothing happened. Adding the file name to the CLI only brought up the help list.

man tesseract gives the manpage.

I was hoping for a GUI or at least a GUI interface to LibreOffice.

Yes, as far as I remember, there once was gImageReader.

I could not find pdfsandwich.

https://software.opensuse.org/package/pdfsandwich?search_term=pdfsandwich

Hi
Yes, gImageReader is still there :wink: Just in my home repository (been waiting for 3.3.0 to appear and then may push to the Publishing repo), as seen pdfsandwich is already there…

https://software.opensuse.org/package/gimagereader

Your rpm doesn’t contain a binary. Only doc and icons. Or did I miss something?

Hi
The gtk and/or qt5 package :wink:
https://build.opensuse.org/package/binaries/home:malcolmlewis:openSUSE_General/gimagereader/openSUSE_Tumbleweed

Trying to install the Leap 15 Versions fails with:

libQt5Core.so.5(Qt_5.11)(64bit) benötigt von gimagereader-qt5-3.2.3-1.26.x86_64 wird nirgends zur Verfügung gestellt

Translation: libQt5Core… isn’t available.

Correct, because Leap 15 comes with:

/usr/lib64/libQt5Core.so.5
/usr/lib64/libQt5Core.so.5.9
/usr/lib64/libQt5Core.so.5.9.4

Would you be so kind and compile again ? Thahaanks!

Hi
This thread is about Tumbleweed, hence my link… :wink:
You need the versions from https://download.opensuse.org/repositories/home:/malcolmlewis:/openSUSE_General/openSUSE_Leap_15.0/

One of the reasons it’s better to start a new thread, even if about the same thing if your on a different release since threads have prefixes :wink:

hmmm… downloaded and installed both tesseract the game and tesseract-ocr the utility. Also installed gimagereader. Neither will start from the menu or CLI. Going to reboot to see if that means anything, but wanted to post before I lost the thread.

I have a blog on this: https://forums.opensuse.org/entry.php/77-openSUSE-12-1-to-openSUSE-Leap-42-3-with-gImageReader-and-Tesseract

Actually - I have installed this on Leap-15.0 as well, but I have yet to update the blog.

I have not tried on Tumbleweed, but I assume same technique would work.
.

You are right, of course.

However, I installed the Leap 15 version and it requests the newer version of libQt5Core, as I wrote. Have a look at it. And that said, let’s drop this issue.

Hi
Hmmm, from the build log from Leap 15.0 (@193 seconds Requires:): libQt5Core.so.5(Qt_5.9)
https://build.opensuse.org/build/home:malcolmlewis:openSUSE_General/openSUSE_Leap_15.0/x86_64/gimagereader/_log

I began by looking for the python files and they were not available for Tumbleweed… at least in the repos I have set up. But thank you for the response.

Hi
There are no python requirements anymore, I tested the GTK version and it’s working fine on Tumbleweed…

Hi Malcolm,
I am on TW Plasma and wanted to give tesseract a go and saw your packaging of gimagereader-qt5
It failed to install because of libpodofo.so 0.9.6 which is required.
I found out that TW has version 0.9.7 so gimagereader won’t start.
Can you help me out getting it to work?

Thnx in adv
Karel

Hi Karel
Modified the spec file, should be rebuilt soon…

Thank you very much!!!

Karel

Let me echo the thankyou to Malcom for gimagereader packaging.

Last night I successfully installed gimagereader on my Lenovo X1 Carbon gen-9 laptop. Granted I have a LEAP-15.3 install, and not Tumbleweed, but never the less I do very much appreciate the work done to first package, and then over many years continue to package this app for openSUSE. I’ve maintained my blog on this here: https://forums.opensuse.org/entry.php/77-openSUSE-12-1-to-openSUSE-Leap-15-3-with-gImageReader-and-Tesseract?bt=1232#comment1232 and I think the approach used for LEAP-15.3 should work for Tumbleweed - only the repositories need to be changed.

The packaging is most appreciated.