System freezes irregularly during a game: amdgpu_job_timedout

I have the following problem: During a game (“Diablo IV” via Lutris) the system freezes completely at irregular intervals. This state lasts for about 10 - 20 seconds, after which the game continues to run without any problems.

At the time of the error I find the following in the log:

greenzack:~ # journalctl --system --since '2024-09-14 15:03' --until '2024-09-14 15:06'
Sep 14 15:04:01 greenzack kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered

Details about my system:

bremse@greenzack:~> inxi -bz
System:
  Kernel: 6.10.9-1-default arch: x86_64 bits: 64
  Desktop: KDE Plasma v: 6.1.5 Distro: openSUSE Tumbleweed 20240912
Machine:
  Type: Desktop System: Micro-Star product: MS-7E16 v: 1.0
    serial: <superuser required>
  Mobo: Micro-Star model: X670E GAMING PLUS WIFI (MS-7E16) v: 1.0
    serial: <superuser required> UEFI: American Megatrends LLC. v: 1.60
    date: 06/19/2024
CPU:
  Info: 12-core AMD Ryzen 9 7900X3D [MT MCP] speed (MHz): avg: 545
    min/max: 545/5660
Graphics:
  Device-1: Advanced Micro Devices [AMD/ATI] Navi 32 [Radeon RX 7700 XT /
    7800 XT] driver: amdgpu v: kernel
  Device-2: Advanced Micro Devices [AMD/ATI] Raphael driver: amdgpu
    v: kernel
  Display: x11 server: X.Org v: 21.1.12 with: Xwayland v: 24.1.2 driver: X:
    loaded: modesetting unloaded: fbdev,vesa dri: radeonsi gpu: amdgpu
    resolution: 3840x2160~60Hz
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 24.1.3 renderer: AMD
    Radeon RX 7800 XT (radeonsi navi32 LLVM 18.1.8 DRM 3.57 6.10.9-1-default)
Network:
  Device-1: Realtek RTL8125 2.5GbE driver: r8169
  Device-2: MEDIATEK MT7922 802.11ax PCI Express Wireless Network Adapter
    driver: mt7921e
Drives:
  Local Storage: total: 1.82 TiB used: 145.48 GiB (7.8%)
Info:
  Memory: total: 32 GiB note: est. available: 30.45 GiB used: 3.87 GiB (12.7%)
  Processes: 472 Uptime: 0h 15m Shell: Bash inxi: 3.3.36

I have found several posts about this error here in the forum, no one seems to describe “my error” exactly though.

Disable builtin GPU, reboot, then post

inxi -aGMSz

I’d try updating Mesa and AMDGPU/kernel itself to bleeding-edge stuff, and DXVK master to see if there’s been any improvements to what’s being ran into to cause the crash/error.

Alternatively disable stuff with RADV/VKD3D/etc envs in-case it’s a specific instruction causing the issue.

Beyond that, you’ll have to figure out what specifically is causing the crash, bisect possibly, try stuff to eliminate that specific crash, and have fun with that rabbit hole. Generally speaking amdgpu_job_timeout could mean anything with AMDGPU’s stack, so good luck :stuck_out_tongue:

1 Like

Builtin GPU disabled, did a reboot, inxi -aGMSz:

System:
  Kernel: 6.10.9-1-default arch: x86_64 bits: 64 compiler: gcc v: 14.2.0
    clocksource: tsc avail: hpet,acpi_pm
    parameters: BOOT_IMAGE=/boot/vmlinuz-6.10.9-1-default
    root=UUID=f3477739-a3b9-41f6-9305-1812d2b68c14 splash=silent
    resume=/dev/disk/by-uuid/525a40ac-1a90-4f1b-8ba3-684767c2a182
    mitigations=auto quiet security=apparmor
  Desktop: KDE Plasma v: 6.1.5 tk: Qt v: N/A info: frameworks v: 6.6.0
    wm: kwin_x11 with: krunner tools: avail: xscreensaver vt: 2 dm: SDDM
    Distro: openSUSE Tumbleweed 20240916
Machine:
  Type: Desktop System: Micro-Star product: MS-7E16 v: 1.0
    serial: <superuser required>
  Mobo: Micro-Star model: X670E GAMING PLUS WIFI (MS-7E16) v: 1.0
    serial: <superuser required> uuid: <superuser required> UEFI: American
    Megatrends LLC. v: 1.60 date: 06/19/2024
Graphics:
  Device-1: Advanced Micro Devices [AMD/ATI] Navi 32 [Radeon RX 7700 XT /
    7800 XT] vendor: Tul / PowerColor driver: amdgpu v: kernel arch: RDNA-3
    code: Navi-3x process: TSMC n5 (5nm) built: 2022+ pcie: gen: 4
    speed: 16 GT/s lanes: 16 ports: active: HDMI-A-1 empty: DP-1, DP-2, DP-3,
    Writeback-1 bus-ID: 03:00.0 chip-ID: 1002:747e class-ID: 0300
  Display: x11 server: X.Org v: 21.1.12 with: Xwayland v: 24.1.2
    compositor: kwin_x11 driver: X: loaded: modesetting unloaded: fbdev,vesa
    dri: radeonsi gpu: amdgpu display-ID: :0 screens: 1
  Screen-1: 0 s-res: 3840x2160 s-dpi: 96 s-size: 1016x571mm (40.00x22.48")
    s-diag: 1165mm (45.88")
  Monitor-1: HDMI-A-1 mapped: HDMI-1 model: ASUS VP28U serial: <filter>
    built: 2019 res: 3840x2160 hz: 60 dpi: 157 gamma: 1.2
    size: 621x341mm (24.45x13.43") diag: 708mm (27.9") ratio: 16:9 modes:
    max: 3840x2160 min: 640x350
  API: EGL v: 1.5 hw: drv: amd radeonsi platforms: device: 0 drv: radeonsi
    device: 1 drv: swrast gbm: drv: kms_swrast surfaceless: drv: radeonsi x11:
    drv: radeonsi inactive: wayland
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 24.1.7 glx-v: 1.4
    direct-render: yes renderer: AMD Radeon RX 7800 XT (radeonsi navi32 LLVM
    18.1.8 DRM 3.57 6.10.9-1-default) device-ID: 1002:747e memory: 15.62 GiB
    unified: no
  API: Vulkan v: 1.3.290 layers: 8 device: 0 type: discrete-gpu name: AMD
    Radeon RX 7800 XT (RADV NAVI32) driver: N/A device-ID: 1002:747e
    surfaces: xcb,xlib

Do you think this has already fixed the error?

As written, the freeze occurs irregularly. I probably won’t be able to test it again until the weekend.

Why modesetting? Why not amdgpu?
Something in config files?

1 Like

To be honest, I have no idea. I haven’t explicitly configured it that way and I wouldn’t know how to change modesetting to amdgpu.

But: I have now played for several hours under Wayland instead of x11, and the error has not occurred so far. Maybe it is an x11 peculiarity? Anyway, I can live with working with Wayland instead of x11.