Tesseract-ocr: wrong data directory

My tesseract 5.3.0-1.1 installation from the main repository uses the following data directory “/usr/share/tessdata” when checking in the file system. However, the directory expected by the command “tesseract” is expected to be in /usr/share/tesseract-ocr/tessdata/

        tesseract --list-langs    
        List of available languages in "/usr/share/tesseract-ocr/tessdata/

The default directory seems to be compiled into tesseract, according to the spec file by -DTESSDATA_PREFIX=%{_datadir}/%{name}

Something is wrong here. I cannot call tesseract --list-langs without setting the env var TESSDATA_PREFIX to /usr/share/tessdata. Is this just an issue with my installation or this this a package issue with tesseract?

It’s all described:

https://github.com/tesseract-ocr/tessdoc/blob/main/Installation.md

Various types of training data can be found on GitHub. Unpack and copy the .traineddata file into a ‘tessdata’ directory. The exact directory will depend both on the type of training data, and your Linux distribution. Possibilities are /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata or /usr/share/tesseract-ocr/4.00/tessdata.

https://github.com/tesseract-ocr/tessdoc/blob/main/Compiling-%E2%80%93-GitInstallation.md

If you want to put the traineddata files in a different directory than the directory that was defined during installation i.e. /usr/local/share/tessdata then you need to set a local variable called TESSDATA_PREFIX to point to the tesseract tessdata directory.
Ex: on Linux Ubuntu, modify your ~/.bashrc file by adding the following to the bottom of it. Modify the path according to your situation:
export TESSDATA_PREFIX=“/home/$USER/Downloads/tesseract/tesseract-4.1.0/tessdata”
Then, close and re-open your terminal for it to take effect, or just call . ~/.bashrc or export ~/.bashrc (same thing) for it to take effect immediately in your current terminal.
Place any language training data you need into this tessdata folder as well. For example, the English one is called eng.traineddata. Download it from the tessdata repository here, and move it to your tessdata directory you just specified in your TESSDATA_PREFIX variable above.

Also see usage examples:
https://github.com/tesseract-ocr/tessdoc/blob/main/Command-Line-Usage.md

osd.traineddata, for Orientation and Segmentation and eng.traineddata and other language data files for English should be in the “tessdata” directory. TESSDATA_PREFIX environment variable should be set to the parent directory of “tessdata” directory.
The following command would give the same result as above, if eng.traineddata and osd.traineddata files are in /usr/share/tessdata directory.
tesseract --tessdata-dir /usr/share imagename outputbase -l eng --psm 3

@susekusi Hi, I just export the prefix via my shell config… export TESSDATA_PREFIX='/usr/share/tessdata/'

I figured out how to fix the issue by setting TESSDATA_PREFIX. However I’d expect this to work automatically. The spec file sets the path to the data dir in the build process, but the path {_datadir}/%{name} seems to be wrong.

It’s not wrong. See the official documentation i linked…

Hi @hui. Looking at it again, I am quite sure the opensuse package is buggy. The opensuse section in your linked document is outdated. It still points to some private repo plus the language packages have wrong names ( tesseract-ocr-traineddata-german instead tesseract-ocr-traineddata-deu)

I checked the spec file of tesseract-ocr-traineddata and found the following relevant install section

mkdir -p %{buildroot}/%{_datadir}/tessdata/
cp -a *.traineddata %{buildroot}/%{_datadir}/tessdata/

now, the relevant part in the tesseract-ocr spec file

%build
%cmake -DCMAKE_INSTALL_LIBDIR=%{_lib} -DTESSDATA_PREFIX=%{_datadir}/%{name}
%cmake_build

In my opinion, the tessdata folders of the two don’t match. I assume the latter should be something like
%cmake -DCMAKE_INSTALL_LIBDIR=%{_lib} -DTESSDATA_PREFIX=%{_datadir}/tessdata

I assume %{name} doesn’t resolve to “tessdata”, but “tesseract-ocr”

Due to the mismatch, we have to set TESSDATA_PREFIX as proposed by @malcolmlewis

@susekusi It actually just needs %{_datadir}, anyway Request 1067857: Submit tesseract-ocr - openSUSE Build Service submitted, still need some more fixes for Leap and the missing -lpthread…

2 Likes

Nothing really to add but I thought it might be of interest that, having successfully run the TESSDATA_PREFIX command in previous installations of Leap, it stubbornly refused to have any effect on my current Tumbleweed install. Tesseract was adamant it wanted the traindata packages in /usr/share/tesseract-ocr/tessdata/ so I added the extra directory manually, and now it’s happy. Malcolm’s proposal appreciated.

perfect, makes sense. Thanks alot @malcolmlewis :grinning: