My tesseract 5.3.0-1.1 installation from the main repository uses the following data directory “/usr/share/tessdata” when checking in the file system. However, the directory expected by the command “tesseract” is expected to be in /usr/share/tesseract-ocr/tessdata/
tesseract --list-langs
List of available languages in "/usr/share/tesseract-ocr/tessdata/
The default directory seems to be compiled into tesseract, according to the spec file by -DTESSDATA_PREFIX=%{_datadir}/%{name}
Something is wrong here. I cannot call tesseract --list-langs without setting the env var TESSDATA_PREFIX to /usr/share/tessdata. Is this just an issue with my installation or this this a package issue with tesseract?
Various types of training data can be found on GitHub. Unpack and copy the .traineddata file into a ‘tessdata’ directory. The exact directory will depend both on the type of training data, and your Linux distribution. Possibilities are /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata or /usr/share/tesseract-ocr/4.00/tessdata.
If you want to put the traineddata files in a different directory than the directory that was defined during installation i.e. /usr/local/share/tessdata then you need to set a local variable called TESSDATA_PREFIX to point to the tesseract tessdata directory.
Ex: on Linux Ubuntu, modify your ~/.bashrc file by adding the following to the bottom of it. Modify the path according to your situation: export TESSDATA_PREFIX=“/home/$USER/Downloads/tesseract/tesseract-4.1.0/tessdata”
Then, close and re-open your terminal for it to take effect, or just call . ~/.bashrc or export ~/.bashrc (same thing) for it to take effect immediately in your current terminal.
Place any language training data you need into this tessdata folder as well. For example, the English one is called eng.traineddata. Download it from the tessdata repository here, and move it to your tessdata directory you just specified in your TESSDATA_PREFIX variable above.
osd.traineddata, for Orientation and Segmentation and eng.traineddata and other language data files for English should be in the “tessdata” directory. TESSDATA_PREFIX environment variable should be set to the parent directory of “tessdata” directory.
The following command would give the same result as above, if eng.traineddata and osd.traineddata files are in /usr/share/tessdata directory. tesseract --tessdata-dir /usr/share imagename outputbase -l eng --psm 3
I figured out how to fix the issue by setting TESSDATA_PREFIX. However I’d expect this to work automatically. The spec file sets the path to the data dir in the build process, but the path {_datadir}/%{name} seems to be wrong.
Hi @hui. Looking at it again, I am quite sure the opensuse package is buggy. The opensuse section in your linked document is outdated. It still points to some private repo plus the language packages have wrong names ( tesseract-ocr-traineddata-german instead tesseract-ocr-traineddata-deu)
In my opinion, the tessdata folders of the two don’t match. I assume the latter should be something like
%cmake -DCMAKE_INSTALL_LIBDIR=%{_lib} -DTESSDATA_PREFIX=%{_datadir}/tessdata
I assume %{name} doesn’t resolve to “tessdata”, but “tesseract-ocr”
Due to the mismatch, we have to set TESSDATA_PREFIX as proposed by @malcolmlewis
Nothing really to add but I thought it might be of interest that, having successfully run the TESSDATA_PREFIX command in previous installations of Leap, it stubbornly refused to have any effect on my current Tumbleweed install. Tesseract was adamant it wanted the traindata packages in /usr/share/tesseract-ocr/tessdata/ so I added the extra directory manually, and now it’s happy. Malcolm’s proposal appreciated.