Desktop boots to KDE, but freezes minutes later

myswtest · September 14, 2023, 12:40pm

This all happened after a zypper dup 2-3 days ago.

I thought I had a graphics issue, but possibly, after watching my desktop box last night, it’s more than that.

Proc: AMD Ryzen 7
Graphics: AMD Radeon RX 560
MB: Asus Crosshair VII x470.

2-3 days ago, running my daily zypper dup, suddenly, the screen went blank, then showed as a very faint, dull, red and black speckled screen all over. I’d read other posts about a graphics issue after a zypper dup.

However … I could not switch to Console 1. I waited 5 minutes to see if the system would recover. Did not.

I did glance over at the desktop case - the network card’s flashing yellow LED was off (it’s always flickering). I also looked at the hard drive LED - zero activity. So this is more than a graphics issue, I’m sure. I had to press the Reboot button on the case.

After booting up again, I checked for signs of the crash with journalctl. Nothing indicates the crash.

Strangely, I have a second separate nvme drive, which can independently boot, also using TW, and basically configured the same. So, I have two independent bootable drives Why? In case one fails, I can instantly boot the other and do work.

I haven’t seen the issue on this second instance (same desktop machine, just different drive).

I also have a laptop, but with Intel proc and non-AMD graphics, with TW, and it works flawlessly.

Any thoughts about the issue? As I mentioned, nothing obvious in journalctl (nothing obvious to me, anyway).

gogalthorp · September 14, 2023, 12:57pm

try with different user it may be a DE config problem

myswtest · September 14, 2023, 1:30pm

Thanks for feedback…

Okay, booted up to KDE Plasma, logged in as a second user acct rarely used.

This time, screen blanked out after 3-4 minutes, then it auto rebooted. As I’m waiting for the Grub menu, about four lines of, I think errors show up, and probably coming from the BIOS (??) showed up, but didn’t last long enough for me to read.

Anyway, Grub showed, booted in and logged in a second time to this rarely used user acct. This time, I jumped to Console 1, gathered journalctl logs.

I then switched back to Console 2 (GUI+ networking, known better as Console 7), and in less than a minute, the screen went hazy, I could not switch to Console 1, the network card LED stopped flashing, and hard drive (nvme) LED stopped

That’s where the desktop is now - the box is still powered and lit up with lights, but no drive/network activity, and the Dell monitor reports, “No activity from DP input port” and puts monitor to sleep

This is possibly a hardware issue (?).

What I think I will do now is power down, wait an hour, then boot to the other nvme/TW drive instance and see what happens.

AdmFubar · September 14, 2023, 7:15pm

I thinking it might be a drive issue… what are the specs of your drives? Make and models, and how much space is used/free?

myswtest · September 14, 2023, 10:50pm

Quick update, then I’ll answer the drive question.

After the two freezes earlier, I booted back a 3rd time. All was fine - no freeze. Stayed logged in for an hour, doing a bit of work. I then shut down the machine (off) and decided to let it rest for a couple of hours.

After two hours, I booted it to the second drive with the TW install (never saw the issue on that instance). No problems - I worked for an hour, then shut the machine down for an hour. Then I booted to the first TW drive / instance, where I’ve seen the issue a few times. Nothing yet. It’s still running right now, and idle.

This is a wild hunch - I always shut it down for the night, so it sits for 8-10 hours until I boot it up again. I always boot to that 1st instance, where I see the issue … then I boot to the second instance and run a zypper dup, etc, and do a bit of work. But I never see the issue with that second instance.

I have gathered the logs, that covers the last couple of boots, so I’ll need t.o scan through those and post when I find something. I figure, if I’m gonna see this happen again, it’ll be tomorrow, and the machine has sat for hours, off.

myswtest · September 14, 2023, 11:01pm

Here we go
smartctl

# smartctl -a /dev/nvme1n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.5.2-1-default] (SUSE RPM)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 500GB
Serial Number:                      xxxxxxxxxxxxxxx
Firmware Version:                   1B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Utilization:            69,765,758,976 [69.7 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5781b20298
Local Time is:                      Thu Sep 14 16:52:36 2023 CDT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    11,799,972 [6.04 TB]
Data Units Written:                 20,283,604 [10.3 TB]
Host Read Commands:                 100,314,033
Host Write Commands:                296,616,545
Controller Busy Time:               1,231
Power Cycles:                       725
Power On Hours:                     2,517
Unsafe Shutdowns:                   138
Media and Data Integrity Errors:    0
Error Information Log Entries:      1,509
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               39 Celsius
Temperature Sensor 2:               44 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       248     0  0x0014  0x4004      -            0     0     -  Invalid Field in Command

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged
#

size info … first df, then btrfs version

# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1p2   40G   16G   24G  41% /
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs            32G  4.0K   32G   1% /dev/shm
efivarfs        128K   42K   82K  34% /sys/firmware/efi/efivars
tmpfs            13G  2.1M   13G   1% /run
/dev/nvme1n1p2   40G   16G   24G  41% /.snapshots
/dev/nvme1n1p2   40G   16G   24G  41% /boot/grub2/i386-pc
/dev/nvme1n1p2   40G   16G   24G  41% /opt
/dev/nvme1n1p2   40G   16G   24G  41% /var
/dev/nvme1n1p2   40G   16G   24G  41% /root
/dev/nvme1n1p2   40G   16G   24G  41% /usr/local
/dev/nvme1n1p2   40G   16G   24G  41% /srv
/dev/nvme1n1p2   40G   16G   24G  41% /boot/grub2/x86_64-efi
/dev/nvme1n1p2   40G   16G   24G  41% /tmp
/dev/nvme1n1p1  500M  5.9M  494M   2% /boot/efi
tmpfs           6.3G   56K  6.3G   1% /run/user/1000

/dev/nvme1n1p3  394G   47G  348G  12% /home
#

… btrfs

# btrfs filesystem usage -T /
Overall:
    Device size:                  40.00GiB
    Device allocated:             20.99GiB
    Device unallocated:           19.01GiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                         15.90GiB
    Free (estimated):             23.71GiB      (min: 23.71GiB)
    Free (statfs, df):            23.71GiB
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:               46.89MiB      (used: 0.00B)
    Multiple profiles:                  no

                  Data     Metadata  System                             
Id Path           single   single    single   Unallocated Total    Slack
-- -------------- -------- --------- -------- ----------- -------- -----
 1 /dev/nvme1n1p2 19.96GiB   1.00GiB 32.00MiB    19.01GiB 40.00GiB     -
-- -------------- -------- --------- -------- ----------- -------- -----
   Total          19.96GiB   1.00GiB 32.00MiB    19.01GiB 40.00GiB 0.00B
   Used           15.26GiB 649.83MiB 16.00KiB                           
#

AdmFubar · September 15, 2023, 9:16pm

Everything looks ok from the report. I’d try booting from an external drive, to make sure that it isnt an issue with the motherboard. I was thinking it was an issue with your drive (it still could be) but you arent dealing with a sandisk or WD, there are some issues with those.
Try booting from an external drive to test your system and see of if the issue persists. are there any cards plugged into the system? pcie? or the like? you may need to pull them and and start the system just to make sure it isnt an easy to replace card that might be acting up.

myswtest · September 15, 2023, 10:11pm

Thanks for the research and follow-up!

That desktop machine has two NVME drives - and each drive can boot TW independently of the other.

The GRUB on each drive has an entry to boot its native TW, and can also boot the TW on the other drive.

The idea is if one of the TWs experiences a catastrophic failure, I can instantly boot to the other and do work. And take my time to fix the other.

So #1 works fine and shows no symptoms of freezing, or other odd behavior. It’s #2 that has the freezes.

I’m still running thru the logs to see if anything shows up.

mrmazda · September 16, 2023, 12:21am

Is the problem stick NVME V4 in a PCIe V4 slot, while the other NVME is in a PCIe V3 slot? If yes, this could be a V4 hardware issue. I think you’d need to spend more time looking in dmesg rather than journal if it’s a hardware issue.

myswtest · September 16, 2023, 12:57am

Thanks for the feedback!

One drive is a 960 EVO Series 500GB NVMe M.2 Internal SSD, bought in April 2017.

The other drive is a SAMSUNG 970 EVO M.2 2280 500GB PCIe Gen3. X4, NVMe 1.3 64L V-NAND 3-bit MLC Internal Solid State Drive (SSD) MZ-V7E500BW, bought in August 2018.

I bought each for two different machines I built. Eventually, I retired the older machine, so I took that NVME and added to the newer machine, which is what I’m using today.

I will double-check which is which and which slots each is plugged into. I will take a guess now that the older 960 NVME is the ‘secondary’ drive/TW, the one with the freezing problem.

They have both been in that machine since 2018 /2020 (960).

I’ll be booting that machine in about 20 minutes

myswtest · September 16, 2023, 2:58am

Okay - I booted the machine first, and installed a BIOS update (was behind one version), then shut it down for an hour),

Booted 2nd time and ran zypper dup (do it daily) , let it run 30 minutes - no freeze, shut it down. Booted 3rd time, still no freeze, been running 30 minutes now. Usually the random freeze happens on the 1st bootup of the day.

So, both M2 (mvne) slots are the same. The random freeze happens with the TW install on the Samsung 970 EVO drive.

Motherboard M2 slot specs

M.2_1 Socket 3 with M Key, type 2242/2260/2280/22110 
(PCIE 3.0 x4 and SATA modes) storage devices support

M.2_2 Socket 3 with M Key, type 2242/2260/2280 
(PCIE 3.0 x4) storage devices support

myswtest · September 16, 2023, 2:18pm

Booted at 08.30, then I noticed the screen turn red/black at 8.38.
So I hit the Reset button on the machine, and booted back in, and captured journal logs.

So, here’s my debugging problem - notice the log below.
The last log entry was at 8.33, but it froze at 8.38.
Then we see the boot re-start at 8.39 (when I hit Reset button).

So there is nothing in the logs that indicates any sort of error / failure.
The freeze happened twice yesterday and the logs are the same - logging stops, and it’s minutes later that the display goes blank.

As I mentioned, suddenly, the screen turns a dull red/black, then 2-3 seconds, it goes blank.
The network activity LED goes out, and I don’t see the disk drive LED flicker anymore.
Any keyboard activity is ignored, so my only option is to hit Reset on machine.

Very perplexing. Not sure how to proceed now, with zero failure evidence.
I guess I need to run a RAM-testing app, and then a disk-drive testing app.

(keep in mind, the other TW install on the other drive doesnt show this behavior, so maybe a disk drive issue?)

:# journalctl -S today

---- other log entries since boot up snipped ----
Sep 16 08:32:10 dbus-daemon[2043]: [session uid=1000 pid=2043] Successfully activated service 'org.kde.kwalletd5'
Sep 16 08:32:10 latte-dock[3206]: [3206:3214:0916/083210.309745:ERROR:object_proxy.cc(642)] Failed to call method: org.kde.KWallet.close: 
Sep 16 08:32:10 latte-dock[3206]: [3206:3214:0916/083210.309770:ERROR:kwallet_dbus.cc(418)] Error contacting kwalletd5 (close)
Sep 16 08:33:07 systemd[2016]: app-opera\x2dbeta-968233e038794922b60714da983a5ea4.scope: Consumed 48.271s CPU time.

-- Boot 2cdc964dec8344cc8dc680a6396b9342 --
Sep 16 08:39:55 kernel: Linux version 6.5.2-1-default (geeko@buildhost) (gcc (SUSE Linux) 13.2.1 20230803 
Sep 16 08:39:55 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.2-1-default root=UUID=7875dffe-6f22-
Sep 16 08:39:55 kernel: BIOS-provided physical RAM map:
---- log continues ----

myswtest · September 16, 2023, 6:45pm

Well, I’ve had two freezes earlier (I’m currently booted into the other TW/drive instance). This last freeze, I rebooted again and captured journal logs - I’m thinking I found something wrong with the NVME drive for that other TW instance.

However, I downloaded a tool called HDSentinel to do a check on that other NVME.
It’s odd, but it reports the drive is “perfect”.

The first code block is the output of HDSentinel - the second code block is the last few lines of the journal log, just before it froze - you will see the BTRFS errors at the end before the next boot-up sequence begins.

Any thoughts (?) - maybe the BTRFS filesystem for / is going bad? (my /home is separate )

# ./HDSentinel
Hard Disk Sentinel for LINUX console 0.20.10851 (c) 2023 info@hdsentinel.com
Start with -r [reportfile] to save data to report, -h for help

Examining hard disk configuration ...

HDD Device  0: /dev/nvme0             
HDD Model ID : Samsung SSD 970 EVO 500GB
HDD Serial No: xxxxxxxxxxxxxxx
HDD Revision : 1B2QEXE7
HDD Size     : 476940 MB
Interface    : NVMe
Temperature  : 41 °C
Highest Temp.: 41 °C
Health       : 100 %
Performance  : 100 %
Power on time: 105 days, 0 hours
Est. lifetime: more than 1000 days
Total written: 9.46 TB
  The status of the solid state disk is PERFECT. Problematic or weak sectors were not found. 
  The health is determined by SSD specific S.M.A.R.T. attribute(s):  Available Spare (Percent), Percentage Used
    No actions needed.

journal logs

Sep 16 09:04:40 systemd[1]: Finished Backup /etc/sysconfig directory.
Sep 16 09:10:02 smartd[1068]: Device: /dev/sda [SAT], old test of type S not run at 
Sep 16 09:10:02 smartd[1068]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
Sep 16 09:26:08 systemd[1]: Starting Backup RPM database...
Sep 16 09:26:08 kernel: BTRFS warning (device nvme0n1p2): csum failed root 533 ino 9541247 off 202940416 csum 0xc4672adb expected csum 0x1284964b mirror 1
Sep 16 09:26:08 kernel: BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 0, rd 0, flush 0, corrupt 1722, gen 0
Sep 16 09:26:08 backup-rpmdb[6015]: cat: /usr/lib/sysimage/rpm/Packages.db: Input/output error
Sep 16 09:26:08 kernel: BTRFS warning (device nvme0n1p2): csum failed root 533 ino 9541247 off 202940416 csum 0xc4672adb expected csum 0x1284964b mirror 1
Sep 16 09:26:08 kernel: BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 0, rd 0, flush 0, corrupt 1723, gen 0
Sep 16 09:26:08 kernel: BTRFS warning (device nvme0n1p2): csum failed root 533 ino 9541247 off 202940416 csum 0xc4672adb expected csum 0x1284964b mirror 1
Sep 16 09:26:08 kernel: BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 0, rd 0, flush 0, corrupt 1724, gen 0
Sep 16 09:26:09 kernel: BTRFS warning (device nvme0n1p2): csum failed root 533 ino 9541247 off 202940416 csum 0xc4672adb expected csum 0x1284964b mirror 1
Sep 16 09:26:09 kernel: BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 0, rd 0, flush 0, corrupt 1725, gen 0
Sep 16 09:26:09 kernel: BTRFS warning (device nvme0n1p2): csum failed root 533 ino 9541247 off 202940416 csum 0xc4672adb expected csum 0x1284964b mirror 1
Sep 16 09:26:09 kernel: BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 0, rd 0, flush 0, corrupt 1726, gen 0
Sep 16 09:26:18 backup-rpmdb[6021]: gzip: stdin: Input/output error
Sep 16 09:26:18 backup-rpmdb[6011]: ERROR!! can not backup RPM Database to /var/adm/backup/rpmdb.
Sep 16 09:26:18 backup-rpmdb[6011]: Maybe there is not enough disk space.
Sep 16 09:26:18 kernel: BTRFS warning (device nvme0n1p2): csum failed root 533 ino 9541247 off 202940416 csum 0xc4672adb expected csum 0x1284964b mirror 1
Sep 16 09:26:18 kernel: BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 0, rd 0, flush 0, corrupt 1727, gen 0
Sep 16 09:26:18 kernel: BTRFS warning (device nvme0n1p2): csum failed root 533 ino 9541247 off 202940416 csum 0xc4672adb expected csum 0x1284964b mirror 1
Sep 16 09:26:18 kernel: BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 0, rd 0, flush 0, corrupt 1728, gen 0
Sep 16 09:26:18 kernel: BTRFS warning (device nvme0n1p2): csum failed root 533 ino 9541247 off 202940416 csum 0xc4672adb expected csum 0x1284964b mirror 1
Sep 16 09:26:18 kernel: BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 0, rd 0, flush 0, corrupt 1729, gen 0
Sep 16 09:26:18 systemd[1]: backup-rpmdb.service: Deactivated successfully.
Sep 16 09:26:18 systemd[1]: Finished Backup RPM database.
Sep 16 09:26:18 systemd[1]: backup-rpmdb.service: Consumed 10.065s CPU time.
Sep 16 09:30:14 wickedd-dhcp6[1180]: enp6s0: Committing DHCPv6 lease with:
Sep 16 09:30:14 wickedd-dhcp6[1180]: enp6s0    +ia-na.address xxxx:xxxx:9e0:xxxx::23/0, pref-lft 3600, valid-lft 3600
---------- logging stopped, then froze some minutes lateer ---------------

-- Boot xxxxxxxxxxxd04e6fa9725ba44f1416ac --
Sep 16 09:59:29 kernel: Linux version 6.5.2-1-default (geeko@buildhost) (gcc (SUSE Linux) 
Sep 16 09:59:29 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.2-1-default root=UUID=78
Sep 16 09:59:29 kernel: BIOS-provided physical RAM map: