Searchable pdf text

What would be the best way to convert a few pages of pdf image into searchable pdf text using the tools available in OS-Linux? Any suggestions?

An image is an image. You must use OCR if you want it as text. Tesseract is probably the most common. At least what I use.

@hnimmo I use a program called normcap (which uses tesseract) to capture image text.
https://dynobo.github.io/normcap/

https://build.opensuse.org/package/show/home:malcolmlewis:TESTING/normcap

3 Likes

Is there a normcap rpm package? I am wary of using an unfamiliar process.

@hnimmo it doesn’t build for Leap, hence the suggestion to use the flatpak version. Or you could use distrobox with Tumbleweed and install my rpm, you need to download it manually fro the openSUSE Build Service as I don’t publish to any repositories.

1 Like

This is what I get on trying to download the package

No data for home:malcolmlewis:TESTING / normcap

@hnimmo just to be clear, the rpm I have built will not work on Leap 15.5.

To download the rpm you need to be logged into the openSUSE Build Service.

1 Like

Thank you for this tip, I installed the flatpak on Kalpa, it’s an amazing tool.

1 Like

I’m not quite sure I understand what you are telling me.
What would I need to do to install normcap on Leap 15.5?

@hnimmo the easiest way is the flatpak version (As your user, not system wide)…

# Add flathub repository as user
flatpak --user remote-add --if-not-exists flathub https://flathub.org/repo/flathub.flatpakrepo
# Install flatpak applications
flatpak --user install --noninteractive com.github.tchx84.Flatseal \
                  com.github.dynobo.normcap

Flatseal can be used to configure permissions file access etc…

there is an AppImage on
https://github.com/dynobo/normcap

1 Like

What is the procedure to install an Appimage?

https://docs.appimage.org/introduction/quickstart.html#ref-quickstart

1 Like

Did that.

NormCap start screen comes up ok. Pink box lines appear at screen edges, but as soon as I try to move the edges they disappear as soon as I let go the mouse button and apparently NormCap has terminated.

There is a nice 5 seconds video on github which shows how to use it. Also a tutorial is shown when opening the Normcap Appimage.

You don’t move the edges of the screen. You simply mark the area on the screen which you want to detect with left mouse button
After that you will have a popup with the detected text…

1 Like

When I mark the area, the selection box disappears immediately and nothing more happens, even if the text selection is ‘simple’ and I wait for a minute. No symbol appears on the top bar.

It is rapid, take a look in your clipboard.

1 Like

Indeed, yes. The clipboard has the text. But otherwise, the rest of the process is not according to the tutorial.

I fear that I cannot make normcap work properly on my system…

Use ocrmypdf, it’s not built for opensuse unfortunately (maybe there’s an OBS package maintained by someone, dunno), but you can install it as a pip package (as in Python).

https://ocrmypdf.readthedocs.io/en/latest/installation.html#other-linux-packages

Then simply run:

ocrmypdf -l $lang $pdfFileName $pdfFileName

Last two params being input PDF and desired output PDF file name, use same values if you want to replace the non-OCR one with the OCR one (maybe run a few tests before, I use this in a script to OCR the stuff I scan, so if something goes wrong and I lose the original PDF, I can simply re-scan the document)

Ocrmypdf depends on tesseract, so be sure to install it from zipper, package name is tesseract-ocr

You may need additional language packs for improved recognition.

tesseract-ocr-traineddata-eng should be installed alongside the main package, IIRC, but if you need additional languages be sure to add, for example, tesseract-ocr-traineddata-ita

PaperWork - available as AppImage fortunately - does a great job on recognizing PDF text.

1 Like