Since last week my GPU seems to have a penchant to reset, and then the whole desktop gets reinitialised.
I took a look in dmesg and see that the GPU itself is resetting:
4252.690429] [ T6406] BTRFS info (device dm-1): qgroup scan completed (inconsistency flag cleared)
[ 4819.233149] [ T5224] amdgpu 0000:03:00.0: amdgpu: Dumping IP State
[ 4819.234412] [ T5224] amdgpu 0000:03:00.0: amdgpu: Dumping IP State Completed
[ 4819.234455] [ T5224] amdgpu 0000:03:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[ 4819.234456] [ T5224] amdgpu 0000:03:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[ 4819.234457] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=611934, emitted seq=611936
[ 4819.234461] [ T5224] amdgpu 0000:03:00.0: amdgpu: Process FreeCAD pid 3849 thread FreeCAD:cs0 pid 3853
[ 4819.234463] [ T5224] amdgpu 0000:03:00.0: amdgpu: Starting gfx_0.0.0 ring reset
[ 4821.234763] [ T5224] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=RESET
[ 4821.234767] [ T5224] amdgpu 0000:03:00.0: amdgpu: failed to reset legacy queue
[ 4821.234769] [ T5224] amdgpu 0000:03:00.0: amdgpu: reset via MES failed and try pipe reset -110
[ 4821.234770] [ T5224] amdgpu 0000:03:00.0: amdgpu: The CPFW hasn't support pipe reset yet.
[ 4821.234771] [ T5224] amdgpu 0000:03:00.0: amdgpu: Ring gfx_0.0.0 reset failed
[ 4821.234773] [ T5224] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[ 4823.388967] [ T5224] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 4823.388973] [ T5224] amdgpu 0000:03:00.0: amdgpu: failed to unmap legacy queue
[ 4823.580615] [ T5224] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[ 4823.684169] [ T5224] amdgpu 0000:03:00.0: amdgpu: MODE1 reset
[ 4823.684172] [ T5224] amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
[ 4823.684235] [ T5224] amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
[ 4824.185763] [ T5224] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 4824.185855] [ T5224] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[ 4824.185894] [ T5224] amdgpu 0000:03:00.0: amdgpu: VRAM is lost due to GPU reset!
[ 4824.185895] [ T5224] amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
[ 4824.263140] [ T5224] amdgpu 0000:03:00.0: amdgpu: reserve 0xa700000 from 0x83e0000000 for PSP TMR
[ 4824.507136] [ T5224] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 4824.507138] [ T5224] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
[ 4824.507140] [ T5224] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[ 4824.507143] [ T5224] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x00505500 (80.85.0)
[ 4824.507146] [ T5224] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
[ 4824.606685] [ T5224] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[ 4824.616183] [ T5224] amdgpu 0000:03:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x07002F00
[ 4824.686341] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 4824.686343] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 4824.686345] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 4824.686346] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 4824.686347] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 4824.686348] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 4824.686349] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 4824.686350] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 4824.686351] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 4824.686352] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 4824.686353] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 4824.686354] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 4824.686356] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8
[ 4824.686357] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 4 on hub 8
[ 4824.686358] [ T5224] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0
[ 4824.689592] [ T5224] amdgpu 0000:03:00.0: amdgpu: GPU reset(1) succeeded!
[ 4824.703339] [ T5224] amdgpu 0000:03:00.0: [drm] device wedged, but recovered through reset
System specs:
CPU: 7800X3D
Motherboard: Asus 670E ProArt
RAM: 2x 32GiB DDR5 6000
GPU: AMD RX7800 XT (ASRock Challenger)
I am using Tumbleweed with KDE. It doesn’t matter if I use X11 or Wayland, the behaviour is the same.
This disrupts a lot of my work. I have couple of projects in FreeCAD which take up to 45 minutes to load. (not exaggerating) I would like to avoid having to wait for close to an hour to be able to work again. In addition, it is also a security threat, as the browser will start anew, leaving dangling login sessions on websites which cannot always be terminated.
It doesn’t only occur with FreeCAD, BTW. I also get these resets with Thunderbird at times.
Is there something I can do to make the resets go away?