Random CPU spike that causes system crash

bonedriven · December 16, 2021, 3:03pm

It’s a rather new installation of Leap 15.3. It happens randomly, sometimes many times a day, sometimes it’s only once a day.

Sometimes cpu spike goes off after 15 secs. Most of the time it slows down everything till whole system freezes to death. Can’t see any pattern in /var/log/message during the time of cpu spike.
Many times it appears with the snapper service coming up but some freeze doesn’t involve snapper. BTW I don’t know why the snapper time-line service comes up every hour although in the config file it is off (in /etc/snapper/configs/root,** **TIMELINE_CREATE=“no”). It also appears many freeze happen when watching a youtube video but I can’t be sure.

Baloo/search is disabled.

Any suggestion to diagnose?

Svyatko · December 16, 2021, 5:23pm

Need more info.

dcurtisfra · December 16, 2021, 5:53pm

Have you disabled the systemd Journal?

bonedriven · December 17, 2021, 2:00pm

I know. I am just trying to understand where to look.

sudo journalctl -b -p err

has nothing.

Yast2->systemd journal can only show the current boot. The “from date ** to date **” doesn’t work and I don’t know why.

I personally find it harder to diagnose errors for linux than windows. There are so many kinds of logs everywhere: var/log/message, systemd log, x.log, DE log.
I remember in windows, I could just look at system events.

dcurtisfra · December 17, 2021, 6:49pm

Ahemmm …

You view was almost true before systemd arrived on the scene – and yes, the logs are collected in the ‘/var/log/’ directory tree – only in one place – one directory tree …

However, since systemd arrived on the scene, that has changed – dramatically –

For 99.x % of all system administration cases, logging is located in the systemd Journal – and, it’s accessed by “journalctl”. (Full Stop) …

Please read the “journactl” man page.

The options you’ll mostly need are –
*=2]“–output=short-monotonic”

shows monotonic timestamps instead of wallclock timestamps.

*=2]“–no-hostname”

Don’t show the hostname field of log messages originating from the local host. This switch has an effect only on the short family of output modes

*=2]“–list-boots”

Show a tabular list of boot numbers (relative to the current boot), their IDs, and the timestamps of the first and last message pertaining to the boot.

*=2]“–boot=[ID]±offset]|all]”

Show messages from a specific boot.

*=2]“–disk-usage”

Shows the current disk usage of all journal files. This shows the sum of the disk usage of all archived and active journal files.

*=2]“–verify”

Check the journal file for internal consistency.

*=2]“–vacuum-size=, --vacuum-time=, --vacuum-files=”

Removes the oldest archived journal files until the disk space they use falls below the specified size (specified with the usual “K”, “M”, “G” and “T” suffixes), or all archived journal files contain no data older than the specified timespan (specified with the usual “s”, “m”, “h”, “days”, “months”, “weeks” and “years” suffixes), or no more than the specified number of separate journal files remain.

Please note that, there are several classes of Journals – the main classes are, the system Journal and, the per-User Journals – normally only the user “root” can access the system Journal but, Users who are members of the “systemd-journal” Group or, the “adm” Group or, the “wheel” Group also have read access to the system Journal. Only Users who are members of the “wheel” Group can also perform systemd Journal maintenance tasks.
[HR][/HR]My personal view is that, the Linux systemd Journal provides more information in a more understandable fashion, than the Windows system events log.

That may well be due to, the fact that the systemd Journal entries are written by the developers, without any management or marketing filters with commercial interest being applied …

dcurtisfra · December 17, 2021, 6:55pm

Yes, I know, even in this Forum, there is advice, for the case of KDE Plasma, to disable Baloo.

I, personally, disagree
.


 > LANG=C balooctl status
Baloo File Indexer is running
Indexer state: Idle
Total files indexed: 159,813
Files waiting for content indexing: 0
Files failed to index: 0
Current size of index is 108.79 MiB
 > 
 > LANG=C balooctl indexSize
File Size: 108.79 MiB
Used:      61.69 MiB

           PostingDB:      10.16 MiB    16.476 %
          PositionDB:      11.11 MiB    18.008 %
            DocTerms:       9.76 MiB    15.823 %
    DocFilenameTerms:       9.09 MiB    14.741 %
       DocXattrTerms:            0 B     0.000 %
              IdTree:       2.61 MiB     4.236 %
          IdFileName:      10.23 MiB    16.590 %
             DocTime:       6.64 MiB    10.764 %
             DocData:            0 B     0.000 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:       2.07 MiB     3.362 %
 >

Svyatko · December 17, 2021, 7:01pm

I meant

inxi -aFz

or other stuff.

dcurtisfra · December 17, 2021, 7:12pm

It seems that, at least your system partition is using the Btrfs file system.

Possibly some Btrfs housekeeping hasn’t been executed – with the user “root” –

Please check the status of the following systemd services – “btrfs-balance.timer”, “btrfs-scrub.timer”, “btrfs-trim.timer” and “btrfs-defrag.timer”.
If they’re disabled, enable them.
Manually start the “btrfs-balance.service” – and, wait for it to complete execution – check with “systemctl status btrfs-balance.service”.
Manually start the “btrfs-scrub.service” – and, wait for it to complete execution – check with “systemctl status btrfs-scrub.service”.

Snapper usually waits for any Btrfs housekeeping to complete – a conflict is unlikely.

But, it may well be that, Snapper is having trouble claiming enough storage space to do it’s thing – with the user “root” – “snapper–iso list” and “btrfs qgroup show -p /” …

Documentation is here – <System recovery and snapshot management with Snapper | Reference | openSUSE Leap 15.5.
And here – <System recovery and snapshot management with Snapper | Reference | openSUSE Leap 15.5.
And, in the FAQs – <System recovery and snapshot management with Snapper | Reference | openSUSE Leap 15.5;
[INDENT=2]“Why does Snapper never show changes in /var/log, /tmp and other directories?”
[/INDENT]

larryr · December 17, 2021, 8:36pm

If you want a more Windows like openSUSE 15.3 - try gecko linux MATE or XFCE versions - they can be tried on USB or CD and installed from it.
I find btrfs and plasma too hard for windows users and adds all sort of issues not found with with ext4 and MATE - but no snapshots and fewer problems.

Link to gecko. https://geckolinux.github.io/

The gecko site will show you what each desktop looks like - updates are standard openSUSE 15.3.

My 2 cents.

dcurtisfra · December 18, 2021, 4:46pm

I’m also not a great Snapshot fan on small single drive (HDD or SSD) systems –

On my one-drive QNAP NAS I’ve finally chosen a single “thick” volume with no snapshots …
I could have chosen a “static” volume with no Storage Pool but, the QNAP procedures more or less “force” one to choose a Storage Pool with either “thick” or “thin” volumes.
*=2]QNAP doesn’t use Btrfs but, they’ve implemented their own snapshots designed to be used with RAID and Storage Pools – in other words, for the case of super redundant physical storage – which I don’t have (with only one HDD in the box) …

mchnz · December 19, 2021, 1:34am

Ahemmm …

For a desktop system things are still a bit messy. There is still /home/username/.local/share/sddm/xorg-session.log and /var/log/Xorg.0.log. In this particular case, if the crash has been caused by a desktop issue, such graphics driver hiccup, some clues might be found there rather than the journal, it pays to check all of them.

In the case of /var/log/Xorg.0.log, after a restart, there should be a /var/log/Xorg.0.log.old. I think /home/username/.local/share/sddm/xorg-session.log will be overwritten on login, so login as someone else, or login to a text console.

Note also that if the system has crashed, the journal and other log files may be incomplete and lack anything helpful. If only the desktop is hung, an ssh in from another machine/phone/tablet would be worth a go.

From the description of the problem, I have in the past seen such behaviour from GPU/browser interactions, in those cases disabling GPU-acceleration in the browser was effective.

I think Linux is a bit of a mess in these situations. The official KDE/gnome projects neglect to include an official comprehensive log viewer that brings together all the logs. Neither includes a GUI for the systemd-journal. The Linux desktop is a fair-weather friend.

bonedriven · December 19, 2021, 12:17pm

I seem to have figured out the freeze issue.

My diagnose got messed up because there were probably two separated sources behind it.

I had much frequent freeze earlier until I disabled the “disk quota” widget that I had enabled just a few days ago. The widget had asked me to install a package to use it but I had ignored it.

When I disabled the widget I still got freeze, but much rarer.

During the last crash I started to notice that 16GB of memory was also full, along with 100% cpu usage. So my current conclusion is that a program might have a memory leak problem (or it cranks up memory usage fast), and I already have a suspect.

So now I’m very careful with running too many programs at the same time and there has not been a freeze any more so far.
Thank you for helping ! I’ve learned quite a lot from the thread…To begin with, I’ll see if baloo/search is good to use today…

dcurtisfra · December 20, 2021, 7:34pm

Linux has been developed from the UNIX® view that, either the hardware is OK or, it isn’t …

This has lead to the situation that you have pointed out that, there are several places to search for an indication as to what is going on.
On the other hand, even with complex Real-Time systems such as mobile telephony base stations – the boxes attached to the antennas – and other components in telephony networks and, things such as radar systems – nobody really makes much effort to condense the system behavioural reports into a single consistent journal.

DEC came close to this “ideal” with VAX/VMS but, even that wasn’t 100 % perfect and, they had the advantage that, it was/is a proprietary OS running on their own hardware and, no body else’s – their 16-bit OSs had only rudimentary system journaling, if at all – ditto their 12-bit and 18-bit OSs – their 36-bit OSs were somewhat better and paved the way for the 32-bit VAX OS …

Another point is, that when a system
crash or power failure occurs, efforts are made to write as much “useful” information to the system’s journal but, such efforts are often doomed to failure due to the systems “last dying gasp” … - In fact, the only reliable crash information is, that of a process crashing in an other wise healthy system – 99.1678 % of the time, the resulting crash dump is useful enough to be used as a starting point for the analysis as to why the thing “turned belly up” – even if only a garbled entry got written to the system journal …

dcurtisfra · December 20, 2021, 7:53pm

Normally, provided the applications have been reasonably well written – rubbish code makes it’s self visible by the means of unexpected system behaviour – UNIX® and Linux tend to manage system resources in a sensible fashion – at the expense of the system slowing down when placed under load but, never simply “expiring” …

Bottom line – just how many processes running on any given Linux box is “too many
” processes?

You’ll need to read the openSUSE “System Analysis and Tuning Guide” – <https://doc.opensuse.org/documentation/leap/tuning/html/book-tuning/book-tuning.html> – to begin to understand what’s happening.

Then, you can move on to a tool such as “Nagios” to remotely monitor the system in question, to come to a reasonable decision as to why the system’s resources are being exhausted …

mchnz · December 21, 2021, 2:05am

dcurtisfra:

Normally, provided the applications have been reasonably well written – rubbish code makes it’s self visible by the means of unexpected system behaviour – UNIX® and Linux tend to manage system resources in a sensible fashion – at the expense of the system slowing down when placed under load but, never simply “expiring” …

Bottom line – just how many processes running on any given Linux box is “too many
” processes?

You’ll need to read the openSUSE “System Analysis and Tuning Guide” – <https://doc.opensuse.org/documentation/leap/tuning/html/book-tuning/book-tuning.html> – to begin to understand what’s happening.

Then, you can move on to a tool such as “Nagios” to remotely monitor the system in question, to come to a reasonable decision as to why the system’s resources are being exhausted …

This approach to the problem is probably most appropriate for a server. It would be nice if there was something simpler in the desktop that just raised a notification if any process consumes a unusual amount of CPU or memory. That way the desktop user has the option of manually investigating such issues before they get out of hand. This won’t deal with every situation, but the common desktop situation is that one program is seriously offending.

I would confess I biased here, because I’ve been experimenting with such an approach (which I’ve previously described in other posts https://forums.opensuse.org/showthread.php/561962-Jouno-a-different-way-of-tracking-system-activity?p=3087111#post3087111).

malcolmlewis · December 21, 2021, 4:17am

Hi
I just use conky… for servers have always used nagios, sec and snmp…

dcurtisfra · December 21, 2021, 11:36am

There’s a KDE Plasmoid named “System Load Viewer” or something similar –

It displays the CPU usage (optionally each core individually), memory usage, swap usage and, cache usage – either as bars or rings.

Svyatko · December 21, 2021, 7:17pm

KDE ksysguard.

mchnz · December 21, 2021, 8:54pm

Yes, I use the new KDE System Monitor, and KSysGuard before it. But they’re passive instruments, you have to keep an eye on them.

I was suggesting something less passive that raises desktop notifications before things get out of hand. I’m experimenting with a python script that raises a notification if any process burns high amounts of CPU or grows-RSS continuously (one notification per incident) . It’s hard to overlook a notification. Plus I don’t have to watch anything, the script watches for me. I’ve mentioned it before in other threads, most recently in https://forums.opensuse.org/showthread.php/564181-Seasons-greetings-from-under-the-process-tree?p=3093037#post3093037, where in the linked-video, the script raises a notification about 30 seconds in. Normally I leave it minimised in the system tray, and only consult its GUI when I see a notification.

The plasmoids could be enhanced to raise notifications when thresholds are crossed (CPU, power, ram, I/O, …). I might suggest that in the KDE forums, presumably the plasmoids are written in C++ and would be more efficient than python.

dcurtisfra · December 22, 2021, 6:04pm

No – they’re mostly written in QML, with some written with Javascript –

Python and C++ used to be used but, that no longer seems to be the case – please search ‘/usr/share/plasma/plasmoids/’ …

The QML of interest for the System Monitor Plasmoids is located in –


/usr/share/plasma/plasmoids/org.kde.plasma.systemloadviewer/
/usr/share/plasma/plasmoids/org.kde.plasma.systemmonitor.cpu/
/usr/share/plasma/plasmoids/org.kde.plasma.systemmonitor.diskactivity/
/usr/share/plasma/plasmoids/org.kde.plasma.systemmonitor.diskusage/
/usr/share/plasma/plasmoids/org.kde.plasma.systemmonitor.memory/
/usr/share/plasma/plasmoids/org.kde.plasma.systemmonitor.net/

Looking at the QML, yes, it may well be that, adding some alarms wouldn’t involve too much effort but, the amount of system resources needed to support the alarms may well be an additional issue.