Experencing random crashes ever since 20240927 - can't figure out what's going on

Hello openSUSE forums! I’m writing since ever since I upgraded to snapshot 20240927 with the 6.11.0 kernel. I’ve been experiencing random occasional total freezes of the system that I can’t for the life of me figure out how to diagnose. I also can’t figure out how to reproduce it which is why I haven’t made any bug reports. I made two different posts about it in the subreddit (1, 2) and they recommended asking about it here. I also uploaded three different logfiles to those posts collected from various times this has happened, which some folks on the subreddit suggested it might have to do with a panic regarding my wifi card, which is considered as a USB device for some reason (I’m on an Acer Aspire 7 laptop). Any idea what’s going on here?

Links to logfiles: (1, 2, 3)

Have you tried testing memory? Random crashes are a common result of RAM trouble.

So first there’s a problem with usb 1-4 which can’t be reset, perhaps a sleep / resume issue. Usb 1-4 is a Mediatek wireless device. Could you do a lsusb?

Then there’s a kernel call trace related to the usb hub, then the kernel detects a stalled queue, which is good, I mean good that it detected the stalled condition, not that it stalled in the first place. Then the kernel tries to tell us all about it with backtraces and that sinks the ship.

Does this only happen after resume from sleep or also if you do a full shutdown and restart?

Not sure how to do that exactly. A reddit user recommended running memtest86+ run on a live distro like system rescue; is that it?

Happens during normal system operation as well as upon waking from sleep. The logs are from crashes that occurred during normal operation

That’s one way. Another may be in the advanced options in your Grub menu. If there isn’t one already there, you can use YaST to put one there.

I don’t know if it’s changed, but at least until recently, memtest86+ wouldn’t work with efi boot, only with mbr IIRC.

I want to say that I’ve seen it with efi more recently in my boot menu with TW, but I don’t reboot frequently enough to remember offhand.

All my UEFI installations have memtest86 free version installed on my ESPs and in my custom.cfgs. I’ve heard there is a memtest86±7 that works on UEFI, but I’ve never confirmed it. memtest86±7.00-2.1 is in TW repos now.

1 Like

Do you recommend 9 passes with memtest? How many passes do you recommend?

4 is my minimum, but the more the better, such as overnight, 7 hours+, maximizing the confidence that the result is correct. Depending on memory speed and quantity, it could take a whole night to do 9.

I did 2 passes and found no errors… I’ll try more but is there any other tests I could run?

I’m wondering if you may have misunderstood since all of your logs show that the system resumed from sleep. Does it also happen if you use the actual boot process that takes 15 seconds or whatever instead of 2 and the actual full shut down, and no sleep at all (closing the lid?). Perhaps you could capture a log if going without sleep for a few days isn’t too hard (see what I did there).

For example, in the October 3 log, the system went to deep sleep at 14:11, resumed from sleep at 21:44, then at 21:45 the usb errors started happening (the rfkill service also gets involved), then many more messages about that same locked up kernel workqueue followed by reset at 22:02. Out of curiosity, do you remember if the system was usable from 21:45 to 22:00 or was it already frozen?

In addition to your other adventures, please open a terminal emulator (for example konsole) and post the output from lsusb (type it, hit Enter).

Does it also happen if you use the actual boot process that takes 15 seconds or whatever instead of 2 and the actual full shut down, and no sleep at all (closing the lid?)

Not sure tbh. I could try but I’m already keeping the system intentionally out of date because I can’t yet reproduce the issue so I can’t know whether or not updating to say 20241007 fixes it. The October 5 log was running 20241002, which is what I’m currently running now

For example, in the October 3 log, the system went to deep sleep at 14:11, resumed from sleep at 21:44, then at 21:45 the usb errors started happening (the rfkill service also gets involved), then many more messages about that same locked up kernel workqueue followed by reset at 22:02. Out of curiosity, do you remember if the system was usable from 21:45 to 22:00 or was it already frozen?

If I recall correctly it froze at about 21:52 and I had to force shutdown at 22:00

In addition to your other adventures, please open a terminal emulator (for example konsole ) and post the output from lsusb (type it, hit Enter).

This is the output of lsusb:

elyg@Elys-Aspire-A715-42G:~> lsusb
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 002: ID 04f3:0c4f Elan Microelectronics Corp. ELAN:Fingerprint
Bus 001 Device 003: ID 04ca:3802 Lite-On Technology Corp. Wireless_Device
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 002: ID 04f2:b72b Chicony Electronics Co., Ltd HD User Facing
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub

I also had this problem a while back. Check sudo journalctl -r for something that indicates btrfs-cleaner taking CPU time while your computer is frozen, if that is the case, then follow this article to fix the problem https://suse.com/support/kb/doc/?id=000020696

Running sudo journalctl -r | grep btrfs-cleaner produced only one line of output:

Oct 08 13:08:31 Elys-Aspire-A715-42G kernel: CPU: 14 UID: 0 PID: 672 Comm: btrfs-cleaner Tainted: P           OE      6.11.0-1-default #1 openSUSE Tumbleweed 461f7965cd54a3c599f269012cdb3d6ce81b3260

Which was after the freezing.

do the same with snapperas well, I forgot what the message looked like exactly, it should say something about taking x seconds of CPU time and that corresponds to the time your computer hangs, just look through the logs and you’ll see it

If possible, connect to the system with a second system (even, say, an android device with an ssh client on it) and run sudo journalctl -f to capture the output. Depending on the nature of the freeze, the cause of the freeze might not be being committed to disk, and looking at it after the fact may be missing an important clue as to what’s going on.

Nope; lots of entries but all the ones about taking x seconds of CPU time were under like 15 seconds :face_with_diagonal_mouth:

I can try that, but I actually haven’t had it come up since my last log. I also rolled back the system to 20241002 and kept it there in case it’s been fixed already. In case it’s a firmware problem tho I’m not sure if firmware updates roll back