Been Running Tumbleweed For 6+ Months - Today OpenSUSE OS Froze For First Time? How To Troubleshoot?

Hi,

Been running openSUSE Tumbleweed KDE 64Bit OS for over 6+ months no issues.
Today the OS froze for the first time and a forced hard reset was necessary.

Desktop is old but dependable.
Hardware is in excellent working order.

How would I troubleshoot a freeze like above?
I don’t know much about where to look for the freeze cause.

Let me know, best OS on the planet!

Jesse

My 2 cents:

  • Please describe the freeze in detail: is the Caps-Lock LED working? Is there still disk activity? (For example, if you plug a thumbdrive into the frozen system, does the hard-drive LED blink?) Does unplugging and reconnecting the keyboard or other peripherals do anything?
  • When the system freezes next time, try switching to the text console (ctrl+alt+F1). Maybe you can see some kernel messages there, and with any luck, login as root user; check dmesg
    , X server and journalctl logs; Do a clean shutdown or reboot from the text console to prevent data loss (»sync && shutdown -h now« and »sync && reboot«, respectively).
  • Inspect the logs around the time when the system first froze. Compare them with older logs when it was still working. Have any log entries changed, or have any new types of errors appeared?

I use journalctl and vim’s diff mode for my log inspections. For example:


*# list date, time and ID of the last few boots:
*$ journalctl --list-boots | tail -n $LINES

*# If the crash happened before the most recent reboot, comparing those logs with the ones before could give insights:
*$ journalctl -b -1 --output=short-precise --no-hostname > bad-boot.text
$ journalctl -b -2 --output=short-precise --no-hostname > good-boot.text
$ vim -d good-boot.text bad-boot.text

The example commands above take all log entries collected by journald during the two most recent reboots, write them in two files and have vim compare them side-by-side. (I did that also to optimize my initrd/systemd activity during boot.) Especially important are the last few log lines before the system froze — maybe journctl managed to log some clues to the cause of the freeze.

If you are in vim, just jump to the end of the files (shift-G or :blush: in normal mode/vim-diff mode; also handy in vim-diff mode: ctrl-W ctrl-W to change to left-/right-hand half).

I also improvised the following grep through the current system log to highlight any obvious warnings, failures, errors or anomalies:

journalctl -b --no-hostname --output=short-precise | grep --color=always -Ein '\S*( no |warn|\Werr|fail|conflict|igno|repeat|inval|n^a-n p-z]t )\S*\s*\S*'

Finally, IF the freeze is related to the X server having trouble with graphics drivers or video hardware, reading through ~/.xsession-errors (the current one and the previous one) might give clues; similarly, read through /var/log/Xorg.0.log:


less -N .xsession-errors
less -N .xsession-errors.old
less -N /var/log/Xorg.0.log
less -N /var/log/Xorg.0.log.old

Good luck, let the forum know what you’ve discovered!

This is typical behavior: https://forums.opensuse.org/showthread.php/516753-Corrupted-BTRFS

How would I troubleshoot a freeze like above? I don’t know much about where to look for the freeze cause.

Make sure btrfs gets the resources it needs for smooth operation. Running the maintenance routines should result in no excessive load:

 **3400G:~ #** journalctl -b -u btrfs*      
-- Logs begin at Sat 2020-12-12 12:09:37 CET, end at Sun 2021-01-03 12:06:44 CET. -- 
Jan 03 10:08:45 3400G systemd[1]: Started Watch /etc/sysconfig/btrfsmaintenance. 
Jan 03 10:08:45 3400G systemd[1]: Started Balance block groups on a btrfs filesystem. 
Jan 03 10:08:45 3400G systemd[1]: Started Defragment file data and/or directory metadata. 
Jan 03 10:08:45 3400G systemd[1]: Started Scrub btrfs filesystem, verify block checksums. 
Jan 03 10:08:45 3400G systemd[1]: Started Discard unused blocks on a mounted filesystem. 
Jan 03 10:08:45 3400G systemd[1]: Started Discard unused blocks on a mounted filesystem. 
Jan 03 10:08:45 3400G systemd[1]: Started Scrub btrfs filesystem, verify block checksums. 
Jan 03 10:08:45 3400G systemd[1]: Started Balance block groups on a btrfs filesystem. 
Jan 03 10:08:45 3400G systemd[1]: Started Defragment file data on a mounted filesystem. 
Jan 03 10:08:45 3400G btrfs-balance.sh[764]: Before balance of / 
Jan 03 10:08:45 3400G btrfs-scrub.sh[781]: Running scrub on / 
Jan 03 10:08:45 3400G btrfs-trim.sh[776]: Running fstrim on / 
Jan 03 10:08:45 3400G btrfs-balance.sh[764]: Data, single: total=20.01GiB, used=15.76GiB 
Jan 03 10:08:45 3400G btrfs-balance.sh[764]: System, single: total=32.00MiB, used=16.00KiB 
Jan 03 10:08:45 3400G btrfs-balance.sh[764]: Metadata, single: total=1.25GiB, used=721.73MiB 
Jan 03 10:08:45 3400G btrfs-balance.sh[764]: GlobalReserve, single: total=46.55MiB, used=0.00B 
Jan 03 10:08:45 3400G btrfs-balance.sh[764]: Dateisystem    Größe Benutzt Verf. Verw% Eingehängt auf 
Jan 03 10:08:45 3400G btrfs-balance.sh[764]: /dev/sda3        43G     18G   25G   42% / 
Jan 03 10:09:20 3400G btrfs-scrub.sh[781]: Scrub device /dev/sda3 (id 1) done 
Jan 03 10:09:20 3400G btrfs-scrub.sh[781]: Scrub started:    Sun Jan  3 10:08:45 2021 
Jan 03 10:09:20 3400G btrfs-scrub.sh[781]: Status:           finished 
Jan 03 10:09:20 3400G btrfs-scrub.sh[781]: Duration:         0:00:35 
Jan 03 10:09:20 3400G btrfs-scrub.sh[781]: Total to scrub:   21.29GiB 
Jan 03 10:09:20 3400G btrfs-scrub.sh[781]: Rate:             481.65MiB/s 
Jan 03 10:09:20 3400G btrfs-scrub.sh[781]: Error summary:    no errors found 
Jan 03 10:09:20 3400G btrfs-scrub.sh[781]: flock: es dauerte 0.000002 Sekunden, um die Sperre zu bekommen 
Jan 03 10:09:20 3400G btrfs-scrub.sh[781]: flock: btrfs wird ausgeführt 
Jan 03 10:09:20 3400G systemd[1]: btrfs-scrub.service: Succeeded. 
Jan 03 10:09:27 3400G systemd[1]: btrfs-defrag.service: Succeeded. 
Jan 03 10:09:30 3400G btrfs-trim.sh[776]: /: 23,5 GiB (25242193920 Bytes) getrimmt 
Jan 03 10:09:30 3400G btrfs-trim.sh[776]: flock: es dauerte 34.545609 Sekunden, um die Sperre zu bekommen 
Jan 03 10:09:30 3400G btrfs-trim.sh[776]: flock: fstrim wird ausgeführt 
Jan 03 10:09:30 3400G systemd[1]: btrfs-trim.service: Succeeded. 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: Done, had to relocate 0 out of 27 chunks 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: flock: es dauerte 43.748183 Sekunden, um die Sperre zu bekommen 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: flock: btrfs wird ausgeführt 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: Dumping filters: flags 0x1, state 0x0, force is off 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]:   DATA (flags 0x2): balancing, usage=5 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: Done, had to relocate 0 out of 27 chunks 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: flock: es dauerte 0.000003 Sekunden, um die Sperre zu bekommen 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: flock: btrfs wird ausgeführt 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: Dumping filters: flags 0x1, state 0x0, force is off 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]:   DATA (flags 0x2): balancing, usage=10 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: Done, had to relocate 0 out of 27 chunks 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: flock: es dauerte 0.000003 Sekunden, um die Sperre zu bekommen 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: flock: btrfs wird ausgeführt 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: Done, had to relocate 0 out of 27 chunks 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: flock: es dauerte 0.000002 Sekunden, um die Sperre zu bekommen 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: flock: btrfs wird ausgeführt 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: Dumping filters: flags 0x6, state 0x0, force is off 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]:   METADATA (flags 0x2): balancing, usage=3 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]:   SYSTEM (flags 0x2): balancing, usage=3 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: Done, had to relocate 1 out of 27 chunks 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: flock: es dauerte 0.000002 Sekunden, um die Sperre zu bekommen 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: flock: btrfs wird ausgeführt 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: After balance of / 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: Data, single: total=20.01GiB, used=15.72GiB 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: System, single: total=32.00MiB, used=16.00KiB 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: Metadata, single: total=1.25GiB, used=792.05MiB 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: GlobalReserve, single: total=46.86MiB, used=0.00B 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: Dateisystem    Größe Benutzt Verf. Verw% Eingehängt auf 
Jan 03 10:09:30 3400G btrfs-balance.sh[764]: /dev/sda3        43G     18G   25G   42% / 
Jan 03 10:09:30 3400G systemd[1]: btrfs-balance.service: Succeeded. 
**3400G:~ #**

Run the routines according to default schedule. Make sure they succeed.

Make sure unallocated space is available:

**3400G:~ #** btrfs filesystem usage -T /               
Overall: 
    Device size:                  40.00GiB 
    Device allocated:             21.29GiB 
    **Device unallocated:           18.71GiB**
    Device missing:                  0.00B 
    Used:                         16.67GiB 
    Free (estimated):             22.81GiB      (min: 22.81GiB) 
    Data ratio:                       1.00 
    Metadata ratio:                   1.00 
    Global reserve:               46.86MiB      (used: 0.00B) 
    Multiple profiles:                  no 

             Data     Metadata  System               
Id Path      single   single    single   Unallocated 
-- --------- -------- --------- -------- ----------- 
 1 /dev/sda3 20.01GiB   1.25GiB 32.00MiB    18.71GiB 
-- --------- -------- --------- -------- ----------- 
   Total     20.01GiB   1.25GiB 32.00MiB    18.71GiB 
   Used      15.90GiB 788.44MiB 16.00KiB             
**3400G:~ #**

At some point btrfs will try to allocate one or more chunks of 1 GB size: https://ohthehugemanatee.org/blog/2019/02/11/btrfs-out-of-space-emergency-response/

Check cleanup of snapshots for proper operation:

3400G:~ # snapper list
    # | Type   | Pre # | Date                     | User | Cleanup | Description           | Userdata     
------+--------+-------+--------------------------+------+---------+-----------------------+--------------
   0  | single |       |                          | root |         | current               |              
 649* | single |       | Sat Oct 10 06:45:34 2020 | root |         | writable copy of #646 |              
1336  | pre    |       | Fri Dec 25 09:21:34 2020 | root | number  | zypp(zypper)          | important=yes
1337  | post   |  1336 | Fri Dec 25 09:30:03 2020 | root | number  |                       | important=yes
1338  | pre    |       | Fri Dec 25 09:30:44 2020 | root | number  | zypp(zypper)          | important=yes
1339  | post   |  1338 | Fri Dec 25 09:31:04 2020 | root | number  |                       | important=yes
1352  | pre    |       | Tue Dec 29 20:42:05 2020 | root | number  | zypp(zypper)          | important=yes
1353  | post   |  1352 | Tue Dec 29 20:44:00 2020 | root | number  |                       | important=yes
1354  | pre    |       | Tue Dec 29 20:45:00 2020 | root | number  | zypp(zypper)          | important=yes
1355  | post   |  1354 | Tue Dec 29 20:48:40 2020 | root | number  |                       | important=yes
1356  | pre    |       | Tue Dec 29 20:49:53 2020 | root | number  | zypp(zypper)          | important=yes
1357  | post   |  1356 | Tue Dec 29 20:50:13 2020 | root | number  |                       | important=yes
1378  | pre    |       | Fri Jan  1 06:06:59 2021 | root | number  | yast partitioner      |              
1379  | post   |  1378 | Fri Jan  1 06:07:51 2021 | root | number  |                       |              
1380  | pre    |       | Fri Jan  1 06:29:00 2021 | root | number  | zypp(zypper)          | important=no 
1381  | post   |  1380 | Fri Jan  1 06:29:15 2021 | root | number  |                       | important=no 
1382  | pre    |       | Sat Jan  2 07:27:21 2021 | root | number  | zypp(zypper)          | important=no 
1383  | post   |  1382 | Sat Jan  2 07:28:47 2021 | root | number  |                       | important=no 
1384  | pre    |       | Sat Jan  2 09:53:56 2021 | root | number  | yast bootloader       |              
1385  | post   |  1384 | Sat Jan  2 09:54:37 2021 | root | number  |                       |              
1386  | pre    |       | Sat Jan  2 09:57:52 2021 | root | number  | yast bootloader       |              
1387  | post   |  1386 | Sat Jan  2 09:58:36 2021 | root | number  |                       |              
1388  | pre    |       | Sun Jan  3 10:23:57 2021 | root | number  | zypp(zypper)          | important=no 
1389  | post   |  1388 | Sun Jan  3 10:24:02 2021 | root | number  |                       | important=no 
3400G:~ # 

For releasing occupied space you may need to move the default snapshot (#649) to the current one (#1389).

Enabling quotas may result in increased system load. Disable them when unneeded.

Hi
How valid it this after almost two years and many kernel updates?

I don’t run with snapshots, been so solid for me don’t see a need with the GNOME DE, just data backups and roll on as Tumbleweed goes…

The wiki says:

**All filesystems for best performance should have some minimum percentage of free space, usally 5-10%. Btrfs, using a two-level space manager (chunks and blocks within chunks) plus being based on (nearly) copy-on-write, benefits from more space to be kept free, around 10-15%, and in particular from having at least one or a few chunks fully unallocated.
**
**It is quite useful to balance periodically any Btrfs volume subject to updates, to prevent the allocation of every chunk in the volume.
**
https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Data_usage_and_allocation

When working for a living the biggest annoyance always was waiting for data on disks and nfs mounts. Wait can be minimized by limiting disk usage to some 70% and plenty of RAM. My machines are fast when booted and they even get faster when being up for some days. I never needed boot into a snapshot and used rollback a few times for convenience only. But I would never disable snapshots because I wanted to gain some 20GB of disk space.