Getting a lot of GPU resets

Since last week my GPU seems to have a penchant to reset, and then the whole desktop gets reinitialised.

I took a look in dmesg and see that the GPU itself is resetting:

 4252.690429] [   T6406] BTRFS info (device dm-1): qgroup scan completed (inconsistency flag cleared)
[ 4819.233149] [   T5224] amdgpu 0000:03:00.0: amdgpu: Dumping IP State
[ 4819.234412] [   T5224] amdgpu 0000:03:00.0: amdgpu: Dumping IP State Completed
[ 4819.234455] [   T5224] amdgpu 0000:03:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[ 4819.234456] [   T5224] amdgpu 0000:03:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[ 4819.234457] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=611934, emitted seq=611936
[ 4819.234461] [   T5224] amdgpu 0000:03:00.0: amdgpu:  Process FreeCAD pid 3849 thread FreeCAD:cs0 pid 3853
[ 4819.234463] [   T5224] amdgpu 0000:03:00.0: amdgpu: Starting gfx_0.0.0 ring reset
[ 4821.234763] [   T5224] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=RESET
[ 4821.234767] [   T5224] amdgpu 0000:03:00.0: amdgpu: failed to reset legacy queue
[ 4821.234769] [   T5224] amdgpu 0000:03:00.0: amdgpu: reset via MES failed and try pipe reset -110
[ 4821.234770] [   T5224] amdgpu 0000:03:00.0: amdgpu: The CPFW hasn't support pipe reset yet.
[ 4821.234771] [   T5224] amdgpu 0000:03:00.0: amdgpu: Ring gfx_0.0.0 reset failed
[ 4821.234773] [   T5224] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[ 4823.388967] [   T5224] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 4823.388973] [   T5224] amdgpu 0000:03:00.0: amdgpu: failed to unmap legacy queue
[ 4823.580615] [   T5224] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[ 4823.684169] [   T5224] amdgpu 0000:03:00.0: amdgpu: MODE1 reset
[ 4823.684172] [   T5224] amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
[ 4823.684235] [   T5224] amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
[ 4824.185763] [   T5224] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 4824.185855] [   T5224] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[ 4824.185894] [   T5224] amdgpu 0000:03:00.0: amdgpu: VRAM is lost due to GPU reset!
[ 4824.185895] [   T5224] amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
[ 4824.263140] [   T5224] amdgpu 0000:03:00.0: amdgpu: reserve 0xa700000 from 0x83e0000000 for PSP TMR
[ 4824.507136] [   T5224] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 4824.507138] [   T5224] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
[ 4824.507140] [   T5224] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[ 4824.507143] [   T5224] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x00505500 (80.85.0)
[ 4824.507146] [   T5224] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
[ 4824.606685] [   T5224] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[ 4824.616183] [   T5224] amdgpu 0000:03:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x07002F00
[ 4824.686341] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 4824.686343] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 4824.686345] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 4824.686346] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 4824.686347] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 4824.686348] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 4824.686349] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 4824.686350] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 4824.686351] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 4824.686352] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 4824.686353] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 4824.686354] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 4824.686356] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8
[ 4824.686357] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 4 on hub 8
[ 4824.686358] [   T5224] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0
[ 4824.689592] [   T5224] amdgpu 0000:03:00.0: amdgpu: GPU reset(1) succeeded!
[ 4824.703339] [   T5224] amdgpu 0000:03:00.0: [drm] device wedged, but recovered through reset

System specs:
CPU: 7800X3D
Motherboard: Asus 670E ProArt
RAM: 2x 32GiB DDR5 6000
GPU: AMD RX7800 XT (ASRock Challenger)

I am using Tumbleweed with KDE. It doesn’t matter if I use X11 or Wayland, the behaviour is the same.

This disrupts a lot of my work. I have couple of projects in FreeCAD which take up to 45 minutes to load. (not exaggerating) I would like to avoid having to wait for close to an hour to be able to work again. In addition, it is also a security threat, as the browser will start anew, leaving dangling login sessions on websites which cannot always be terminated.

It doesn’t only occur with FreeCAD, BTW. I also get these resets with Thunderbird at times.

Is there something I can do to make the resets go away?

If you boot an older kernel, or install and use a longterm kernel, is behavior similarly bad?