System freeze/crash, no idea wha's causing it

hi everyone,
since a few days i’ve been experiencing freezes/crashes on my system, i have no idea what could cause it, but basically after a few hours (around 10 hours of uptime) the system freezes, if any sound was playing, the last 1-2s loops and the whole system stays unresponsive while everything is frozen.
i’ve tried checking the previous journalctl but couldn’t find any kernel error.
i’ve had no problems before a system update but i cant remember which one triggered this behavior.

i’ve noticed however that every time the system crashed i had a youtube video playing, no idea if it’s related since i keep YT open most of the time, however it has never crashed as far as i can remember while no video was playing. i’ve been experiencing audio desyncs while watching youtube as well but pausing/resuming fixes the problem even tho it’s really annoying.

i would appreciate if anyone could help cause i can’t find anything online about this problem and it feels like i’m the only one having this issue.

One thing you could try, since it’s easy enough (if you haven’t already).

Create a new user account, log out of your normal user account, log in as the new user, and duplicate your situation… any difference?

Also, you didn’t mention your system info and last update - TW recently had a huge update … the latest is 20240208.

creating a new account and trying to cause that problem would probably take a while, considering this problem doesn’t even happen every day, the computer can stay stable for over 20 hours with no problem then crash the next day

and for the system info :

What browser? What desktop?

Edit: Never mind desktop, I see that in your screenshot. But, browser.

firefox 122.0

Can you reproduce booted to an older kernel?

Can you reproduce in an X11 session instead of Wayland?

Can you reproduce using an IceWM session instead of Plasma?

To clarify, which of the following is happening:

  1. Is the app freezing for a few seconds and coming back to normal?
  2. Is the app freezing and crashing?
  3. Is the OS freezing and crashing / kernel panicking?

For [1] you can leave a trace running in another terminal or screen session.

For [2] you can look at the coredump and for [3] you can configure kdump.

Can you leave the KDE System Monitor running somewhere with the History tab.

Keep it visible so that when the freeze/crash happens you can see how the memory/cpu usage looks like.

the entire system freezes and stops responding to commands, hitting ctrl-alt-f1 doesn’t do anything
i don’t remember this happening without firefox opened & playing a video, but like it said earlier it could be totally unrelated cause firefox is opened all the time

i can try, but what should i do once it crashes ?
i have 32GB of RAM and 38GB of swap, i guess it’ll take a while to fill the entire ram & crash

@kyral Hi, so what does the output from sensors or the likes of nvtop show could also be thermal related with your system or GPU’s.

the thermals are faily normal, they rarely reach beyond 70°C, the system crashes only happen on low/medium loads

@kyral which GPU is in use at the time, are you using Prime Render Offload with the AMD GPU? No drm or pcie errors in the journal?

the NVIDIA GPU uses the vfio-pci driver so i don’t have any nvidia package installed, i only use the AMG GPU most of the time since the nvidia is reserved for virtual machines
no drm/pci errors afaik

Even if you didn’t saw anything in journal…why not uploading one with a crash to ?
Others may be able to see something you maybe overlooked…

@kyral ahh ok :smile: so could be the amdgpu driver, so are you using any amdgpu options on boot for this, eg feature mask to unlock etc? You might try disabling the core clock with a grub option kernel amdgpu.dc=0 if your using HDMI sound, that will disappear.

good idea, here’s a log from the latest crash openSUSE Paste

i use one option to unlock the voltage control so i could undervolt the gpu

GRUB_CMDLINE_LINUX_DEFAULT=“mitigations=auto amdgpu.ppfeaturemask=0xffffffff amd_iommu=on iommu=pt security=apparmor resume=UUID=eafd5d7c-08e8-46db-8e3d-ed192fb30d8f resume_offset=23747122”

i’ve been undervolting the GPU for months and it’ has been so far very stable on -85mV (originally -95mV but went back to -85mV to ensure stability)

i don’t use hdmi sound

@kyral ok, so if you remove that option temporarily, does the system still crash?

The full unlocked featureset is known to be unstable.
See AMDGPU - ArchWiki

So i would go with Malcolms advice and remove the parameter to see if the system still crashes…