baloo_file_extractor saturates cpu and disk

Yet another Benchmark:

  • Western Digital Blue 1 TB and 4 TB – both with XFS user filesystems.
  • Setup for content indexing – stopped Baloo – removed everything in ‘~/.local/share/baloo/’ – resumed Baloo.
  • New indexing began at 17:30 August 26th 2022.
  • New indexing completed at 19:12 August 26th 2022.

 > LANG=C balooctl status
Baloo File Indexer is running
Indexer state: Idle
Total files indexed: 42,964
Files waiting for content indexing: 0
Files failed to index: 0
Current size of index is 4.47 GiB
 > 

But, “journalctl --user -b 0” is showing multiple errors while indexing the content –

  • “Error: Invalid Font Weight”
  • Invalid encoding. Ignoring «File Name
    » - “Error: Unknown font tag ‘Helvetica’”
  • “Error: Incorrect password”
  • “Error: Illegal annotation destination”
  • “Error: Invalid least number of objects reading page offset hints table”
  • “Error: not an ICC profile, invalid signature”
  • “Error: read ICCBased color space profile error”
  • “Error: Expected the default config, but wasn’t able to find it, or it isn’t a Dictionary”
  • “Error: Mismatch between font type and embedded font file”
  • “Error: Can’t get Fields array<0a>”
  • “Error: Embedded font file may be invalid”
  • “Error: Mismatch between font type and embedded font file”
  • “Error: Couldn’t find trailer dictionary”

The files with “Invalid encoding” errors included –

  • Lilypond source files (Markup for musical notation)
  • .vcf Address-book export files
  • .html files
  • .txt files
  • .csv files
  • .mbox files

[HR][/HR]Conclusion:

  • Baloo’s content indexing is unreliable – if you’re searching for something, there’s no guarantee that, the content you need has been included in the Index database. :’(

Yes - That’s one of many reasons I kicked it into the long grass… :wink:

If you want file content indexing then try recoll, you may be quietly impressed, but do take the time to look through the documentation. It, so far, has never failed in any way, and I’ve been using it now for 5, maybe 6 years. KDE’s file content indexing has never, IMHO, been particularly good, be it nepomuk (I think I have the name correct, baloo’s predecessor), or baloo.

Glad to learn that tumbleweed works like a charm but, i use 15.4.

BTW, since yesterday, baloo seems to have have made a huge step. 85% this afternoon and ~98% now.
It looks like the baloo was blocked by some huge file, like a boa trying to swallow an elephant (hat trick for those who saw the Antoine de St Exupéry metaphor).



**12:50:21
**
LANG=C balooctl status 
Baloo File Indexer is running
Indexer state: Indexing file content
Total files indexed: 364,453
Files waiting for content indexing: 50,569
Files failed to index: 100
Current size of index is 53.64 GiB



[FONT=monospace][FONT=monospace]**18:09:27**

 LANG=C balooctl status 
Baloo File Indexer is running
Indexer state: Idle
Total files indexed: 368,829
Files waiting for content indexing: 4,728
Files failed to index: 100
Current size of index is 53.66 GiB
[/FONT][/FONT]

Well i thought it would be useful, like when you look for some lost info in your phone, but, yet i still haven’t used it.

How do you use the btrfs file system to find some useful info ?

What’s “vault files” ?

Thank you for your input. But why opensuse chose a bad a solution, when you know the good one ?
Pros and cons ?

Two directories are indexed : 1 with btrfs about 20GB, and the other with ext4 and about 200GB

Thanks. Can i use recoll in dolphin, like baloo ?

Some more testing of Baloo content indexing:

baloo_file_extractor[22646]: Invalid encoding. Ignoring "/home/??/Heruntergeladen/Cherry/KeyMan/cymolin-0.6.0-2/Makefile"

 > cat /home/???/Heruntergeladen/Cherry/KeyMan/cymolin-0.6.0-2/server/plugins/misc/Makefile
##################################################################
#   Copyright � 2004, Sebastian Heutling. All rights reserved.
#   $Id: Makefile,v 1.3 2004/09/11 15:03:25 basti Exp $
#
#   Descr. : Makefile for the misc plugin
#   Authors: Sebastian Heutling, sheutlin@gmx.de
##################################################################
TOP      := ../../..
include $(TOP)/config/make.tmpl
LD       :=$(CXX)
FILES    := init **miscplugin**
LIBNAME  := misc.so
OBJECTS  := $(foreach f, $(FILES), .$(f).o)
DEPENDS  := $(foreach f, $(FILES), .$(f).d)

INCLUDES := -I$(TOP)/server/include
CXXFLAGS += $(INCLUDES)
LDLIBS   := -L/usr/X11R6/lib -lX11 -lrt
LDFLAGS  := $(LDLIBS) -Wl,--no-undefined

.PHONY: all depend clean

all: $(LIBNAME)

$(eval $(call link_so,$(LIBNAME),$(OBJECTS),$(LDFLAGS)))
$(eval $(call compile_source,cpp,Makefile,$(CXXFLAGS)))
$(eval $(call build_depends,cpp,Makefile,$(CXXFLAGS)))
cleanobjects=$(OBJECTS) $(LIBNAME) $(DEPENDS)
$(eval $(call clean,$(cleanobjects)))
$(eval $(call distclean,$(cleanobjects)))

install:
        @$(INSTALL) -d $(DESTDIR)$(PLUGINDIR)
        @$(INSTALL_DATA) $(LIBNAME) $(DESTDIR)$(PLUGINDIR)

ifeq (,$(filter clean distclean depend install,$(MAKECMDGOALS)))
-include $(DEPENDS)
endif

 > 

Making it easy and fast: “grep” recursively for “miscplugin” in ‘~/Heruntergeladen/Cherry/’ –

  • grep
    ” finds the Makefile and the C++ source file, header file, init file and plugin file. - “Dolphin
    ” only finds the C++ source file and header file.

[HR][/HR]Your mileage may vary

Don’t mind the difference. I did a ‘balooctl purge’ and restarted with the following configuration:

karl@erlangen:~> cat .config/baloofilerc 
[General]
dbVersion=2
exclude filters=*~,*.part,*.o,*.la,*.lo,*.loT,*.moc,moc_*.cpp,qrc_*.cpp,ui_*.h,cmake_install.cmake,CMakeCache.txt,CTestTestfile.cmake,libtool,config.status,confdefs.h,autom4te,conftest,confstat,Makefile.am,*.gcode,.ninja_deps,.ninja_log,build.ninja,*.csproj,*.m4,*.rej,*.gmo,*.pc,*.omf,*.aux,*.tmp,*.po,*.vm*,*.nvram,*.rcore,*.swp,*.swap,lzo,litmain.sh,*.orig,.histfile.*,.xsession-errors*,*.map,*.so,*.a,*.db,*.qrc,*.ini,*.init,*.img,*.vdi,*.vbox*,vbox.log,*.qcow2,*.vmdk,*.vhd,*.vhdx,*.sql,*.sql.gz,*.ytdl,*.class,*.pyc,*.pyo,*.elc,*.qmlc,*.jsc,*.fastq,*.fq,*.gb,*.fasta,*.fna,*.gbff,*.faa,po,CVS,.svn,.git,_darcs,.bzr,.hg,CMakeFiles,CMakeTmp,CMakeTmpQmake,.moc,.obj,.pch,.uic,.npm,.yarn,.yarn-cache,__pycache__,node_modules,node_packages,nbproject,core-dumps,lost+found
exclude filters version=8
first run=false
only basic indexing=false
karl@erlangen:~> 

Got a better performance (lower cpu time):

karl@erlangen:~> LANG=C balooctl status 
Baloo File Indexer is running 
Indexer state: Idle 
Total files indexed: 179,769 
Files waiting for content indexing: 0 
Files failed to index: 0 
Current size of index is 8.12 GiB 
karl@erlangen:~> 
karl@erlangen:~> LANG=C balooctl indexSize 
File Size: 8.12 GiB 
Used:      3.81 GiB 

           PostingDB:       1.36 GiB    35.591 % 
          PositionDB:       1.57 GiB    41.345 % 
            DocTerms:     850.81 MiB    21.812 % 
    DocFilenameTerms:      13.54 MiB     0.347 % 
       DocXattrTerms:            0 B     0.000 % 
              IdTree:       3.10 MiB     0.079 % 
          IdFileName:      14.36 MiB     0.368 % 
             DocTime:       7.59 MiB     0.195 % 
             DocData:       5.79 MiB     0.149 % 
   ContentIndexingDB:            0 B     0.000 % 
         FailedIdsDB:            0 B     0.000 % 
             MTimeDB:       4.48 MiB     0.115 % 
karl@erlangen:~> 

Beware of the fine print: Baloo - KDE Community Wiki Note:

Due to a glib bug, the MIME type of HTML files can change from text/html to application/x-extension-html. The KDE file metadata extractors don’t recognize the latter. That bug has a workaround to reset the MIME types to the usual values.

I fixed the issue by running:

rm /home/karl/.local/share/mime/packages/user-extension-html.xml
update-mime-database /home/karl/.local/share/mime/

Thanks to all the persons who are patient enough to follow this thread.
After 4 other days, the baloo seems to be stucked again at a few millimeters from the arrival.
The system control kde GUI says that indexing is finished but, it says 99% finished, which is quite strange.
And i have this :



**--christophe@mamachine 12:15:21 ~]** LANG=C balooctl status  
Baloo File Indexer is running 
Indexer state: Idle 
Total files indexed: 365,479 
Files waiting for content indexing: 150 
Files failed to index: 100 
Current size of index is 53.97 GiB 
**--christophe@mamachine 14:00:14 ~]** LANG=C balooctl indexSize  
File Size: 53.97 GiB 
Used:      1.95 GiB 

           PostingDB:       3.73 GiB   190.614 % 
          PositionDB:       1.10 GiB    56.479 % 
            DocTerms:       1.03 GiB    52.893 % 
    DocFilenameTerms:      27.37 MiB     1.367 % 
       DocXattrTerms:            0 B     0.000 % 
              IdTree:       3.52 MiB     0.176 % 
          IdFileName:      32.56 MiB     1.627 % 
             DocTime:      15.69 MiB     0.784 % 
             DocData:       3.25 MiB     0.163 % 
   ContentIndexingDB:       4.00 KiB     0.000 % 
         FailedIdsDB:       4.05 MiB     0.202 % 
             MTimeDB:       6.45 MiB     0.322 %

In the firste command, we see that, like before i purged and re-initialized the indexing, there are 100 files that baloo cannot swallow.

In the second “indexsize” command, the invention of 190% is something i find hard to understand…

And, i still have 100% cpu of baloo_file triggering the temperature of my laptop to go above 70°C and the fans at full speed. :frowning:

I think i will give baloo a few more days to be more quiet.
Otherwise, i will turn it off.

It’s sad, but i had to turn it off.
Nevertheless, i like the idea of finding a lost file with a tool like baloo.
Anyone could teach me how to use btrfs to find quickly a lost file ?

You’re using KDE Plasma –

  • The Dolphin search function for content works just fine with Baloo turned off – the (wild card) search for file names works a little bit better if, Baloo indexes the file names only.

Apart from that, there are the Filesystem independent Command Line search functions – which tend to function better (since many, many, years) than almost anything else –


 > find . -iname '*«part of file name»*'
 > find «Directory Path» -iname '*«part of file name»*' -print
 > grep -iER '«a string»|«another string»|«yet another string»' .
 > grep -iER '«a string»|«another string»|«yet another string»' «Directory Path»