System suddenly hung

Hi,

I was typing some text and there was an empty browser window open on the side. Suddenly KDE Plasma froze. The system stopped responding to any key combination. The only way out was cold reboot.

Looking at journalctl I see the last messages are:


Sep 17 11:37:32 i7 kernel: nouveau 0000:01:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT]
Sep 17 11:37:32 i7 kernel: nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
Sep 17 11:37:32 i7 kernel: nouveau 0000:01:00.0: fifo: channel 5: killed
Sep 17 11:37:32 i7 kernel: nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery
Sep 17 11:37:32 i7 kernel: nouveau 0000:01:00.0: chromium[5299]: channel 5 killed!
-- Reboot --
...

What is this and why is it happening?

Nasty! This may be of interest to you…
https://bbs.archlinux.org/viewtopic.php?id=239485

In particular…

  				Start chromium with the --disable-gpu flag. There are some well known problems with the way chromium uses the GPU and nouveau: https://bbs.archlinux.org/viewtopic.php?id=235968

Noveau has a bit of a hard stand here, they don’t receive official support from Nvidia and have to reverse engineer most of the functionality.

Thanks for the info. Hm. I have been running nouveau and chromium since late July but this has never happened before. In fact I even have the flag ignore-gpu-blacklist set to ‘enabled’ and it has not caused any problems so far. Are you sure the two things are related or the killing chromium is merely a side effect?

I searched a bit and found this:

https://bugs.freedesktop.org/show_bug.cgi?id=100567

but unfortunately it doesn’t give me the answer as I don’t understand everything written.

Hard to know. So far, for you it would seem, it’s only occurred with the Chromium browser running?

I searched a bit and found this:

https://bugs.freedesktop.org/show_bug.cgi?id=100567

but unfortunately it doesn’t give me the answer as I don’t understand everything written.

It looks related to what you encountered and not yet resolved apparently.

There were also other open programs: I was typing a message in Claws Mail when the system hung, there was a konsole session open, Dolphin too. I don’t know if it would be correct to say that chromium was the culprit just because it was running with a single empty tab in incognito mode. Well - yes, it is the only program which shows in the journal but still it shows after the nouveau error, not before it. Quite confusing.

Is this also happening with a “new; fresh; untainted” user?
[HR][/HR]The reason I’m asking is, I also noticed this on an “upgraded from Leap 42.3” system which is non-Nvidia and non-Chromium …

For now, my solution is as follows: logout from the Plasma 5 GUI; totally clean out and remove " ~/.cache/ "; check and clean out every thing else in the user’s home directory which seemed to be “ancient”; checked the files in " /etc/skel/ " against those in the user’s home directory and updated them as needed to the Leap 15.0 versions (file date … ); log back in again …

Never tried. It happened this one single time.

For now, my solution is as follows: logout from the Plasma 5 GUI; totally clean out and remove " ~/.cache/ ";

Seems irrelevant in my case as I have mounted my ~/.cache as tmpfs.

check and clean out every thing else in the user’s home directory which seemed to be “ancient”;

Which is what exactly?

checked the files in " /etc/skel/ " against those in the user’s home directory and updated them as needed to the Leap 15.0 versions (file date … );

I am not sure I understand. Are you saying that you updated file dates? And why is all this necessary? It seems to be a driver bug, no?

Possibly not a bad idea – did you also take a look at simply linking “~/.cache/” to a user’s directory either in ‘/run/user/«User’s UID»/’ or ‘/dev/shm/’ ?

I’ve noticed that Linux upgrade procedures do very little to check the user’s home directories for the consequences of the updated default files located in “/etc/skel/” …

  • Exception #1: KDE – KDE seems to be doing a reasonable job of migrating the user’s configuration from one major update to another …
  • Exception #2: Mozilla Firefox – currently (IMHO) world master in migrated a user’s data from one version (also the minor version changes) to another …

No. I meant check the file date and (assuming that the file hasn’t been edited … ) upgrade to the current “master” version located in “/etc/skel/” …

As I mentioned ~8 lines above – IMHO Linux seems to assume that (possibly for historical reasons … ), the system administrators will ensure that the user’s configuration data will be OK for the current Linux system version …

If the (KDE) user’s environment is OK and, “new, fresh” users are also experiencing this issue then, it’s quite possible that, it’s a driver bug or, something else, such as either Qt and/or KDE Frameworks having a timing issue when a given hardware combination is being used …

No. But now that I look I see that /run/user/<id> is only 3.1G while my manual tmpfs mount is 16G. I have no idea how to use /dev/shm.

I’ve noticed that Linux upgrade procedures do very little to check the user’s home directories for the consequences of the updated default files located in “/etc/skel/” …

Here:


# tree -aD /etc/skel/
/etc/skel/
├── [May 18  1996]  .bash_history
├── [May 13  0:19]  .bashrc
├── [Jul 14 13:35]  .claws-mail
│   └── [Mar 19 17:26]  accountrc.tmpl
├── [Jun  7 21:45]  .config
├── [Apr  9 10:49]  .emacs
├── [Jun  7 21:45]  .fonts
├── [Feb 19  2018]  .i18n
├── [Apr  9 10:49]  .inputrc
├── [Jun  7 21:45]  .local
├── [Jul 27 15:11]  .muttrc
├── [May 13  0:19]  .profile
├── [Mar 19  5:26]  .urlview
├── [Feb 19  2018]  .xim.template
├── [Apr 22 21:21]  .xinitrc.template
└── [Jun  7 21:45]  bin

5 directories, 11 files

From those the ones which “tell me” me something meaningful are:

.claws-mail - which I suppose is irrelevant and a left over (as I currently compile CM from source and install it in /opt)
.emacs - I don’t even have emacs installed
.muttrc - same as emacs

Considering that - could you please advise what I need to upgrade in the context you speak about?

As I mentioned ~8 lines above – IMHO Linux seems to assume that (possibly for historical reasons … ), the system administrators will ensure that the user’s configuration data will be OK for the current Linux system version …

Are you saying that this is a wrong assumption? Why would it be? If it really is - then something is quite wrong in the software as a whole.

If the (KDE) user’s environment is OK and, “new, fresh” users are also experiencing this issue then, it’s quite possible that, it’s a driver bug or, something else, such as either Qt and/or KDE Frameworks having a timing issue when a given hardware combination is being used …

The problem is that I don’t know how to reproduce this one time hang up. So even if I take all the time to create a new user, migrate all my configurations etc to it and start working as it - it may be a simple waste of time. So it would rather make more sense to have the knowledge about how to debug it but I don’t have it.

Then, you have a considerable amount of main memory: AFAICS ‘/run/user/«User’s UID»/’ usually allocates about 10 % of the physical installed memory …

Provided that, no application you’re aware of is using Emacs components or Mutt components and, you’ve removed the Emacs and Mutt packages then, you can delete the configuration files from the user’s home directory …

Not really; UNIX® and Linux fundamentally have this approach – if one is unsure as to what the user has configured then, don’t touch the user’s configuration when upgrading …

Configuring a new user takes about 5 seconds.

Testing if the new user also experiences the system hang will take some time.
Because, you’ll have to test “default new user” and then, step-by-step, test the effect of adding portions of your configuration, to determine if a particular part of your configuration is provoking the system hang …

32G

Provided that, no application you’re aware of is using Emacs components or Mutt components and, you’ve removed the Emacs and Mutt packages then, you can delete the configuration files from the user’s home directory …

Done.

Not really; UNIX® and Linux fundamentally have this approach – if one is unsure as to what the user has configured then, don’t touch the user’s configuration when upgrading …

This makes sense. Still it doesn’t look like a valid reason for total system hang (which is not user but kernel level, right?)

Because, you’ll have to test “default new user” and then, step-by-step, test the effect of adding portions of your configuration, to determine if a particular part of your configuration is provoking the system hang …

Exactly. I will have to dedicate all my time to it and it is quite possible to be an experiment without a result. I suppose there isn’t much to be done in this direction but of course - I appreciate the replies.

It ain’t an experiment, it’s (methodical) testing …

AFAICS, there’s no definite evidence that, the hardware and possibly also the driver is causing the issue …
As unpleasant as it may be, the issue may be caused by the user’s environment …

  • Unfortunately, proving that, may involve another identical system running a newer Linux Kernel version with a newer version of the Graphic driver …

[HR][/HR]By the way, from my personal experience, testing intermittent system outages is extremely resource (tester team + hardware) intensive: real-time 24x7 “totally reliable” (99.999 % uptime – downtime less than 5½ minutes per year) systems usually need several months of continual intensive testing to a) discover the cause of the problem and then b) the same again to prove that the issue has been repaired …

Which are pretty much the same thing. I agree with your observations about intermittent system outage testing.

AFAICS, there’s no definite evidence that, the hardware and possibly also the driver is causing the issue …
As unpleasant as it may be, the issue may be caused by the user’s environment …

Doesn’t that contradict the essential separation between kernel and user space? If something in the home directory can break things for everyone (the whole system)… I don’t know.

No. Even if Kernel and User space are separated, if a user process begins using 99.99 % CPU at a higher priority than other user processes then, the other user processes will have to wait before their system calls are processed …

  • Take a look at the sched(7) man page:

With the advent of the CFS scheduler in kernel 2.6.23, Linux adopted an algorithm that causes relative differences in nice values to have a much stronger effect. In the current implementation, each unit of difference in the nice values of two processes results in a factor of 1.25 in the degree to which the scheduler favors the higher priority process. This causes very low nice values (+19) to truly provide little CPU to a process whenever there is any other higher priority load on the system, and makes high nice values (-20) deliver most of the CPU to applications that require it (e.g., some audio applications).

But that’s something different. What we are talking about is: some setting made in the user space (through a home dir config) causing the whole system to hang. If this is possible - then something is broken on a much deeper (conceptual) level.

My current working view is that, there’s nothing broken with respect to User Space and System Space …

  • Yes, the User Space can make system calls which, via the Kernel, access System Space things such as Drivers and, if a Driver is broken then, the user’s System Call can cause the system to hang – due to a coding error in the Driver …

And, AFAICS, this doesn’t mean that, the implementation of User Space and System Space is broken …

Sounds like such coding error opens another side-channel vulnerability.