AMDGPU errors and occasional crashes/hangs

broadstairs · May 9, 2021, 12:17pm

I am running TW/ DE on an AMD Ryzen 5 3400G VEGA Graphics and recently there have been quite a few errors showing in journal which I did not notice as they did not cause any visible issue until recently. Several like this over several days:-


amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32769, for process Xorg.bin pid 1552 thread Xorg.bin:cs0 pid 1561)
amdgpu:   in page starting at address 0x800101686000 from client 27
amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00101031
amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
amdgpu:          MORE_FAULTS: 0x1
amdgpu:          WALKER_ERROR: 0x0
amdgpu:          PERMISSION_FAULTS: 0x3
amdgpu:          MAPPING_ERROR: 0x0
amdgpu:          RW: 0x0

Then on a couple of occasions I have hang a system crash/hang which made me look at the journal and saw those errors plus these new ones:-


[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=428
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 
amdgpu 0000:2a:00.0: amdgpu: GPU reset begin!
amdgpu 0000:2a:00.0: amdgpu: MODE2 reset
amdgpu 0000:2a:00.0: amdgpu: GPU reset succeeded, trying to resume
amdgpu 0000:2a:00.0: amdgpu: RAS: optional ras ta ucode is not available
amdgpu 0000:2a:00.0: amdgpu: RAP: optional rap ta ucode is not available
amdgpu 0000:2a:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not avail
amdgpu 0000:2a:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
amdgpu 0000:2a:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
amdgpu 0000:2a:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
amdgpu 0000:2a:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
amdgpu 0000:2a:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
amdgpu 0000:2a:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
amdgpu 0000:2a:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
amdgpu 0000:2a:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
amdgpu 0000:2a:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
amdgpu 0000:2a:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
amdgpu 0000:2a:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
amdgpu 0000:2a:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
amdgpu 0000:2a:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
amdgpu 0000:2a:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
amdgpu 0000:2a:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
amdgpu 0000:2a:00.0: amdgpu: recover vram bo from shadow start
amdgpu 0000:2a:00.0: amdgpu: recover vram bo from shadow done
amdgpu 0000:2a:00.0: amdgpu: GPU reset(1) succeeded!
amdgpu 0000:2a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=
amdgpu 0000:2a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=18605
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmas
amdgpu 0000:2a:00.0: amdgpu: GPU reset begin!
amdgpu 0000:2a:00.0: amdgpu: MODE2 reset
amdgpu 0000:2a:00.0: amdgpu: GPU reset succeeded, trying to resume
amdgpu 0000:2a:00.0: amdgpu: RAS: optional ras ta ucode is not available
amdgpu 0000:2a:00.0: amdgpu: RAP: optional rap ta ucode is not available
amdgpu 0000:2a:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not avail
amdgpu 0000:2a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 
[drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <sdma_
amdgpu 0000:2a:00.0: amdgpu: GPU reset(3) failed
amdgpu 0000:2a:00.0: amdgpu: GPU reset end with ret = -110
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

at which point I had to hard reset the system to reboot it.

Anyone any ideas please as to what this could be?

Stuart

karlmistelberger · May 9, 2021, 12:50pm

No such issues here with current system:

**3400G:~ #** inxi -SCMmGz 
**System:    Kernel:** 5.12.0-2-default x86_64 **bits:** 64 **Console:** tty pts/1 **Distro:** openSUSE Tumbleweed 20210507  
**Machine:   Type:** Desktop **Mobo:** ASUSTeK **model:** PRIME B450-PLUS **v:** Rev X.0x **serial:** <filter> **UEFI:** American Megatrends **v:** 2409  
           **date:** 12/02/2020  
**Memory:    RAM:****total:** 29.27 GiB **used:** 3.84 GiB (13.1%)  
           **Array-1:****capacity:** 128 GiB **slots:** 4 **EC:** None  
           **Device-1:** DIMM_A1 **size:** No Module Installed  
           **Device-2:** DIMM_A2 **size:** 16 GiB **speed:** 2133 MT/s  
           **Device-3:** DIMM_B1 **size:** No Module Installed  
           **Device-4:** DIMM_B2 **size:** 16 GiB **speed:** 2133 MT/s  
**CPU:       Info:** Quad Core **model:** AMD Ryzen 5 3400G with Radeon Vega Graphics **bits:** 64 **type:** MT MCP **cache:****L2:** 2 MiB  
           **Speed:** 1246 MHz **min/max:** 1400/3700 MHz **Core speeds (MHz):****1:** 1246 **2:** 1337 **3:** 1255 **4:** 1313 **5:** 1255 **6:** 1258 **7:** 1356  
           **8:** 1259  
**Graphics:  Device-1:** Advanced Micro Devices [AMD/ATI] Picasso **driver:** amdgpu **v:** kernel  
           **Display:****server:** X.Org 1.20.11 **driver:****loaded:** amdgpu,ati **unloaded:** fbdev,modesetting,vesa  
           **resolution:** 1920x1200~60Hz  
           **OpenGL:****renderer:** AMD Radeon Vega 11 Graphics (RAVEN DRM 3.40.0 5.12.0-2-default LLVM 12.0.0) **v:** 4.6 Mesa 21.0.2  
**3400G:~ #**

Run all of free download tests: MemTest86 - Download now!

broadstairs · May 9, 2021, 4:13pm

Well Memtestx86 ran for about 1hour 30 mins before I stopped it and had no errors. Just for completeness I ran that command-


inxi -SCMmGz
System:    Kernel: 5.12.0-2-default x86_64 bits: 64 Desktop: KDE Plasma 5.21.5 Distro: openSUSE Tumbleweed 20210507 
Machine:   Type: Desktop Mobo: Micro-Star model: B450-A PRO MAX (MS-7B86) v: 4.0 serial: <filter> UEFI: American Megatrends 
           v: M.70 date: 06/17/2020 
Memory:    RAM: total: 29.31 GiB used: 2.52 GiB (8.6%) 
           Array-1: capacity: 128 GiB slots: 4 EC: None 
           Device-1: DIMM 0 size: 8 GiB speed: 2133 MT/s 
           Device-2: DIMM 1 size: 8 GiB speed: 2133 MT/s 
           Device-3: DIMM 0 size: 8 GiB speed: 2133 MT/s 
           Device-4: DIMM 1 size: 8 GiB speed: 2133 MT/s 
CPU:       Info: Quad Core model: AMD Ryzen 5 3400G with Radeon Vega Graphics bits: 64 type: MT MCP cache: L2: 2 MiB 
           Speed: 3700 MHz min/max: N/A Core speeds (MHz): 1: 3700 2: 3700 3: 3700 4: 3700 5: 3700 6: 1279 7: 3700 8: 3700 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Picasso driver: amdgpu v: kernel 
           Device-2: Logitech QuickCam E 3500 type: USB driver: snd-usb-audio,uvcvideo 
           Display: x11 server: X.org 1.20.11 driver: loaded: amdgpu,ati unloaded: fbdev,modesetting,vesa 
           resolution: <missing: xdpyinfo> 
           OpenGL: renderer: AMD Radeon Vega 11 Graphics (RAVEN DRM 3.40.0 5.12.0-2-default LLVM 12.0.0) v: 4.6 Mesa 21.0.2

Stuart

broadstairs · May 9, 2021, 4:21pm

Still seeing


[gfxhub0] retry page fault (src_id:0 ring:0 vmid:7 pasid:32772, for process plasmashell pid 2027 thread plasmashel:cs0 pid 2089)
May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:   in page starting at address 0x800105089000 from client 27
May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00741051
May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:          MORE_FAULTS: 0x1
May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:          WALKER_ERROR: 0x0
May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:          MAPPING_ERROR: 0x0
May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:          RW: 0x1

on reboot after Memtestx86.

Am I correct in assuming this could be within the memory reserved for the graphics which I’m guessing Memtestx86 does not touch?

Stuart

malcolmlewis · May 9, 2021, 4:47pm

Hi
Similar issue here: https://bugs.freedesktop.org/show_bug.cgi?id=104192

Likely a regression of some sort in amdgpu, create a bug report against the kernel… openSUSE:Submitting bug reports - openSUSE

broadstairs · May 9, 2021, 7:02pm

OK I’ll do that. The errors started on 26th April and the previous day kernel-firmware-amdgpu 20210419 was installed no idea if that was the cause but seems to be a bit of a coincidence.

Stuart

karlmistelberger · May 9, 2021, 8:22pm

Did you run all 13 tests including the hammer test? I found several systems successfully ran for hours, but eventually failed while running #13 hammer test.

broadstairs · May 9, 2021, 8:38pm

Yes it ran everything once and was up to test 5 on the second run.

Stuart

karlmistelberger · May 9, 2021, 9:31pm

Be patient and run it overnight.Sometimes replacing RAM modules helps, even if no errors are reported by memtest. I have never seen any of your errors on my system. To me it looks like a hardware problem. Disconnect all components and reconnect only what is needed. Temporarily omit optional components, such as all but one RAM module. Check temperatures, reset efi to optimized defaults …

Svyatko · May 10, 2021, 5:18pm

, and update BIOS.

broadstairs · May 10, 2021, 6:23pm

BIOS is the latest, and rather than run Memtest overnight as I have 4 sticks of memory I’ll swap them over (in pairs) and see if the error shows up as I don’t believe Memtest can test shared graphics memory as it’s reserved in BIOS. Once swapped I’ll run Memtest again. Currently the error is only showing after booting up and the system has not crashed since.

Stuart

broadstairs · May 11, 2021, 4:22pm

Well I have swapped the memory round and re-run Memtestx86 with no errors and then ran the hammer test separately twice with no errors reported. On reboot I looked at the journal and the errors I was getting do not show up. Only thing I can think is that removing and reseating the memory has maybe fixed it.

I’ll watch it for a few days and see if the errors come back.

Stuart

Svyatko · May 11, 2021, 5:10pm

Possibly dirty contacts on a memory modules.
You may clean contacts with rubber/alcohols.

broadstairs · May 11, 2021, 11:10pm

I spoke too soon a few minutes after booting I saw the errors again but only a few and none since. I have checked on the BIOS and there is an update which I will load tomorrow and see what happens.

Stuart

karlmistelberger · May 12, 2021, 7:48am

Clean memory slots and camera lenses with 99,9% isopropanol. Use interdental brushes for thorough cleansing.

broadstairs · May 13, 2021, 11:47pm

On further investigation I have found elsewhere that some issues can randomly happen like mine and adding iommu=pt to the kernel boot options has for others fixed problems. I tried his today and so far no recurrence of any errors since I booted with this option. Somewhere else it was suggested to add idle=nomwait but I’ve not tried that yet.

Searching here there was a fix to the kernel on 19th April 2021 which addressed some problems with amdgpu and my problems only started after this around the 26th April which is probably when I installed that update, is it possible that this update could have a bearing on my problem. Also interesting that I had the DRM option enabled in my browser to play DRM content and if I remember correctly this was around the time I had the hang/crash and subsequently turned off that browser option and it has not crashed since.

Anyway I’ll see how it goes over the next few days.

Stuart

bauermann · May 14, 2021, 3:24am

I have a Radeon Vega Picasso integrated GPU and recently started seeing these “retry page fault” errors (but not the GPU resets or the other errors in your original post), and also screen and keyboard freezes in some of the instanes where these errors happen.

I’m running Ubuntu, but I dropped by to say that In my case the problem turned out to be the version of the linux-firmware package. I reverted back from version 1.197 to version 1.190.5 and things are back to the previous level of stability.

So you might want to try downgrading your linux-firmware package as well.

Details of my issue are here, if you’re curious: https://bugs.launchpad.net/ubuntu/+source/linux-firmware/+bug/1928393

broadstairs · May 14, 2021, 10:27am

Thanks that’s very interesting. I think I will now open a bug with openSUSE and see what they have to say. I will continue to monitor and if the problems does reappear I will downlevel the f/w package.

Stuart

broadstairs · May 14, 2021, 11:42am

Because of packaging differences I am unable to find an older openSUSE package for kernel-firmware-all earlier than the April 2021 package which I think introduced the issue so as for now I cannot try downleveling the f/w. Unless someone has access to an older package they can point me to.

Still no errors showing since I added iommu=pt to kernel options on boot.

Stuart

karlmistelberger · May 14, 2021, 11:54am

Without ever tinkering I get:

**3400G:~ #** journalctl -b --grep iommu 
-- Logs begin at Thu 2021-04-29 05:00:44 CEST, end at Fri 2021-05-14 11:50:28 CEST. -- 
**May 13 04:15:39 3400G kernel: iommu: Default domain type: Passthrough  **
May 13 04:15:39 3400G kernel: **pci 0000:00:00.2: AMD-Vi: Unable to read/write to ****IOMMU**** perf counter.**
May 13 04:15:39 3400G kernel: pci 0000:00:01.0: Adding to iommu group 0 
...
May 13 04:15:39 3400G kernel: pci 0000:08:00.6: Adding to iommu group 8 
May 13 04:15:39 3400G kernel: pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40 
May 13 04:15:40 3400G kernel: AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel <jroedel@suse.de> 
**3400G:~ #**