Efficient way of converting images to searchable text

I’m trying to find an efficient way of converting old family history docs to searchable text files, starting from jpg, tiff, png, pdf images. I have hundreds of them, so it will be a long project. Most of the files are images of handwritten documents (English and German), often in tabular form (and not seldom of questionable quality)
I have experimented a bit with transkribus, but preparation of the files needs good graphical support, like variable and precise rotation to maximise ocr accuracy.

Does anyone have any good suggestions?

@hnimmo Looked at normcap? https://dynobo.github.io/normcap/

(As this is not forums feedback, I’ve moved it to the applications section)

…unfamiliar with the installation procedure for normcap on my system. I cannot find a rpm on that site

You simply use the flatpak or appimage.

I simply using my mobile phone camera (Samsung S22+) There is a little [T] mark in lower right corner. If you have an iPhone, no idea, but I would be surprised if it don’t have the same functionality.

I will be using my desktop since all my image files are there.

flatpak installation goes to completion but running normcap fails with `missing SCREENSHOT permission, whatever that means and why that is needed anyway.

Hi!

I use ocrmypdf/OCRmyPDF, like Install OCRmyPDF on Linux | Snap Store, for my satisfaction.

I created a user command for Thunar, so I can convert PDFs in such way by right click. It uses some options and such to even optimise the PDF.

I like NAPS2, they provide an rpm. It uses Tesseract to convert pdf images to text.

Web page makes a good impresssion. As I understand it, NAPS2 needs a pdf image to convert. I would have to convert jpeg, tiff, png files to pdf as a first step. Is that right?

pdf only? What is the reference to Thunar?

i normally use Tesseract

In NAPS2 you can import an image file directly, no need to convert it to an pdf.

Tesseract is the underlying OCR machine used by these other apps as well, as I understand it.

thanks…good to know.

If you are looking for convenience, I would second @rogerf comment. The iPhone have a live text feature that runs locally on the device, that can be used to get the raw text.

I cant say how well it will work on the handwritten documents, it will depend on how neat it is. There are also vision enabled LLM models that can be used to extract and will probably work in cases where an traditional machine learning based OCR will struggle. I believe even with the extracted text, you will want to organize and keep cleaned up images of the documents as you prepare them for text extraction.

What app does the OCR on your iPhone? Are you suggesting I use ChatGPT or the like?
Anyone who has seen old parish registers and the like will know what to expect for ‘neatness’ :grimacing:

The image-text to text feature is built into native photos app in the iPhone. The feature is called Live Text and requires that the device is on iOS 15 (2021) or greater.

On the generative ai topic, the commercial models in the OpenAI ChatGPT and Google Gemini family support images, but I don’t have a recommendation for you. I think you will have to evaluate multiple models and decide the level of quality and effort that is acceptable for you. Additionally, there are open-weights LLM you can run locally. Look for the models that are designed for vision or multimodal capabilities but look for models that are more specialized in the image-text domain as their training datasets used might translate to better performance in OCR tasks, such as Qwen 2.5 VL (https://getomni.ai/blog/benchmarking-open-source-models-for-ocr). I think you will have to evaluate what works for you based on the nature of the documents in the family library, technical capability, and budget.

A traditional OCR engine would probably be cheaper and faster as long as you found one that is acceptable. If you prefer the open source route, consider evaluating frameworks such as Docling and Party. Docling can coordinate multiple traditional OCR engines and also released their own image-text to text model called SmolDocling.