Page 1 of 5 123 ... LastLast
Results 1 to 10 of 47

Thread: AMDGPU errors and occasional crashes/hangs

  1. #1
    Join Date
    Jan 2016
    Location
    UK
    Posts
    754

    Default AMDGPU errors and occasional crashes/hangs

    I am running TW/ DE on an AMD Ryzen 5 3400G VEGA Graphics and recently there have been quite a few errors showing in journal which I did not notice as they did not cause any visible issue until recently. Several like this over several days:-

    Code:
    amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32769, for process Xorg.bin pid 1552 thread Xorg.bin:cs0 pid 1561)
    amdgpu:   in page starting at address 0x800101686000 from client 27
    amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00101031
    amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
    amdgpu:          MORE_FAULTS: 0x1
    amdgpu:          WALKER_ERROR: 0x0
    amdgpu:          PERMISSION_FAULTS: 0x3
    amdgpu:          MAPPING_ERROR: 0x0
    amdgpu:          RW: 0x0
    Then on a couple of occasions I have hang a system crash/hang which made me look at the journal and saw those errors plus these new ones:-

    Code:
    [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=428
    [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 
    amdgpu 0000:2a:00.0: amdgpu: GPU reset begin!
    amdgpu 0000:2a:00.0: amdgpu: MODE2 reset
    amdgpu 0000:2a:00.0: amdgpu: GPU reset succeeded, trying to resume
    amdgpu 0000:2a:00.0: amdgpu: RAS: optional ras ta ucode is not available
    amdgpu 0000:2a:00.0: amdgpu: RAP: optional rap ta ucode is not available
    amdgpu 0000:2a:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not avail
    amdgpu 0000:2a:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
    amdgpu 0000:2a:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
    amdgpu 0000:2a:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
    amdgpu 0000:2a:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
    amdgpu 0000:2a:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
    amdgpu 0000:2a:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
    amdgpu 0000:2a:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
    amdgpu 0000:2a:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
    amdgpu 0000:2a:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
    amdgpu 0000:2a:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
    amdgpu 0000:2a:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
    amdgpu 0000:2a:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
    amdgpu 0000:2a:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
    amdgpu 0000:2a:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
    amdgpu 0000:2a:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
    amdgpu 0000:2a:00.0: amdgpu: recover vram bo from shadow start
    amdgpu 0000:2a:00.0: amdgpu: recover vram bo from shadow done
    amdgpu 0000:2a:00.0: amdgpu: GPU reset(1) succeeded!
    amdgpu 0000:2a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=
    amdgpu 0000:2a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=
    [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=18605
    [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmas
    amdgpu 0000:2a:00.0: amdgpu: GPU reset begin!
    amdgpu 0000:2a:00.0: amdgpu: MODE2 reset
    amdgpu 0000:2a:00.0: amdgpu: GPU reset succeeded, trying to resume
    amdgpu 0000:2a:00.0: amdgpu: RAS: optional ras ta ucode is not available
    amdgpu 0000:2a:00.0: amdgpu: RAP: optional rap ta ucode is not available
    amdgpu 0000:2a:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not avail
    amdgpu 0000:2a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 
    [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <sdma_
    amdgpu 0000:2a:00.0: amdgpu: GPU reset(3) failed
    amdgpu 0000:2a:00.0: amdgpu: GPU reset end with ret = -110
    [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    amdgpu 0000:2a:00.0: amdgpu: couldn't schedule ib on ring <sdma0>
    [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
    [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
    at which point I had to hard reset the system to reboot it.

    Anyone any ideas please as to what this could be?

    Stuart

  2. #2
    Join Date
    Jan 2014
    Location
    Erlangen
    Posts
    2,671
    Blog Entries
    1

    Default Re: AMDGPU errors and occasional crashes/hangs

    Quote Originally Posted by broadstairs View Post
    I am running TW/ DE on an AMD Ryzen 5 3400G VEGA Graphics and recently there have been quite a few errors showing in journal which I did not notice as they did not cause any visible issue until recently. Several like this over several days:-
    ...
    Then on a couple of occasions I have hang a system crash/hang which made me look at the journal and saw those errors plus these new ones:-
    ...
    at which point I had to hard reset the system to reboot it.

    Anyone any ideas please as to what this could be?

    Stuart
    No such issues here with current system:
    Code:
    3400G:~ # inxi -SCMmGz 
    System:    Kernel: 5.12.0-2-default x86_64 bits: 64 Console: tty pts/1 Distro: openSUSE Tumbleweed 20210507  
    Machine:   Type: Desktop Mobo: ASUSTeK model: PRIME B450-PLUS v: Rev X.0x serial: <filter> UEFI: American Megatrends v: 2409  
               date: 12/02/2020  
    Memory:    RAM:total: 29.27 GiB used: 3.84 GiB (13.1%)  
               Array-1:capacity: 128 GiB slots: 4 EC: None  
               Device-1: DIMM_A1 size: No Module Installed  
               Device-2: DIMM_A2 size: 16 GiB speed: 2133 MT/s  
               Device-3: DIMM_B1 size: No Module Installed  
               Device-4: DIMM_B2 size: 16 GiB speed: 2133 MT/s  
    CPU:       Info: Quad Core model: AMD Ryzen 5 3400G with Radeon Vega Graphics bits: 64 type: MT MCP cache:L2: 2 MiB  
               Speed: 1246 MHz min/max: 1400/3700 MHz Core speeds (MHz):1: 1246 2: 1337 3: 1255 4: 1313 5: 1255 6: 1258 7: 1356  
               8: 1259  
    Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Picasso driver: amdgpu v: kernel  
               Display:server: X.Org 1.20.11 driver:loaded: amdgpu,ati unloaded: fbdev,modesetting,vesa  
               resolution: 1920x1200~60Hz  
               OpenGL:renderer: AMD Radeon Vega 11 Graphics (RAVEN DRM 3.40.0 5.12.0-2-default LLVM 12.0.0) v: 4.6 Mesa 21.0.2  
    3400G:~ #
    Run all of free download tests: https://www.memtest86.com/download.htm
    AMD Athlon 4850e (2009), openSUSE 13.1, KDE 4, Intel i3-4130 (2014), i7-6700K (2016), i5-8250U (2018), AMD Ryzen 5 3400G (2020), openSUSE Tumbleweed, KDE Plasma 5

  3. #3
    Join Date
    Jan 2016
    Location
    UK
    Posts
    754

    Default Re: AMDGPU errors and occasional crashes/hangs

    Well Memtestx86 ran for about 1hour 30 mins before I stopped it and had no errors. Just for completeness I ran that command-

    Code:
    inxi -SCMmGz
    System:    Kernel: 5.12.0-2-default x86_64 bits: 64 Desktop: KDE Plasma 5.21.5 Distro: openSUSE Tumbleweed 20210507 
    Machine:   Type: Desktop Mobo: Micro-Star model: B450-A PRO MAX (MS-7B86) v: 4.0 serial: <filter> UEFI: American Megatrends 
               v: M.70 date: 06/17/2020 
    Memory:    RAM: total: 29.31 GiB used: 2.52 GiB (8.6%) 
               Array-1: capacity: 128 GiB slots: 4 EC: None 
               Device-1: DIMM 0 size: 8 GiB speed: 2133 MT/s 
               Device-2: DIMM 1 size: 8 GiB speed: 2133 MT/s 
               Device-3: DIMM 0 size: 8 GiB speed: 2133 MT/s 
               Device-4: DIMM 1 size: 8 GiB speed: 2133 MT/s 
    CPU:       Info: Quad Core model: AMD Ryzen 5 3400G with Radeon Vega Graphics bits: 64 type: MT MCP cache: L2: 2 MiB 
               Speed: 3700 MHz min/max: N/A Core speeds (MHz): 1: 3700 2: 3700 3: 3700 4: 3700 5: 3700 6: 1279 7: 3700 8: 3700 
    Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Picasso driver: amdgpu v: kernel 
               Device-2: Logitech QuickCam E 3500 type: USB driver: snd-usb-audio,uvcvideo 
               Display: x11 server: X.org 1.20.11 driver: loaded: amdgpu,ati unloaded: fbdev,modesetting,vesa 
               resolution: <missing: xdpyinfo> 
               OpenGL: renderer: AMD Radeon Vega 11 Graphics (RAVEN DRM 3.40.0 5.12.0-2-default LLVM 12.0.0) v: 4.6 Mesa 21.0.2
    Stuart

  4. #4
    Join Date
    Jan 2016
    Location
    UK
    Posts
    754

    Default Re: AMDGPU errors and occasional crashes/hangs

    Still seeing

    Code:
    [gfxhub0] retry page fault (src_id:0 ring:0 vmid:7 pasid:32772, for process plasmashell pid 2027 thread plasmashel:cs0 pid 2089)
    May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:   in page starting at address 0x800105089000 from client 27
    May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00741051
    May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
    May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:          MORE_FAULTS: 0x1
    May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:          WALKER_ERROR: 0x0
    May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
    May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:          MAPPING_ERROR: 0x0
    May 09 15:09:57 Tumbleweed.Crowhill kernel: amdgpu 0000:2a:00.0: amdgpu:          RW: 0x1
    on reboot after Memtestx86.

    Am I correct in assuming this could be within the memory reserved for the graphics which I'm guessing Memtestx86 does not touch?

    Stuart

  5. #5
    Join Date
    Jun 2008
    Location
    Podunk
    Posts
    31,277
    Blog Entries
    15

    Default Re: AMDGPU errors and occasional crashes/hangs

    Hi
    Similar issue here: https://bugs.freedesktop.org/show_bug.cgi?id=104192

    Likely a regression of some sort in amdgpu, create a bug report against the kernel.... openSUSE:Submitting bug reports - openSUSE
    Cheers Malcolm °¿° SUSE Knowledge Partner (Linux Counter #276890)
    SUSE SLE, openSUSE Leap/Tumbleweed (x86_64) | GNOME DE
    If you find this post helpful and are logged into the web interface,
    please show your appreciation and click on the star below... Thanks!

  6. #6
    Join Date
    Jan 2016
    Location
    UK
    Posts
    754

    Default Re: AMDGPU errors and occasional crashes/hangs

    Quote Originally Posted by malcolmlewis View Post
    Hi
    Similar issue here: https://bugs.freedesktop.org/show_bug.cgi?id=104192

    Likely a regression of some sort in amdgpu, create a bug report against the kernel.... openSUSE:Submitting bug reports - openSUSE
    OK I'll do that. The errors started on 26th April and the previous day kernel-firmware-amdgpu 20210419 was installed no idea if that was the cause but seems to be a bit of a coincidence.

    Stuart

  7. #7
    Join Date
    Jan 2014
    Location
    Erlangen
    Posts
    2,671
    Blog Entries
    1

    Default Re: AMDGPU errors and occasional crashes/hangs

    Quote Originally Posted by broadstairs View Post
    Well Memtestx86 ran for about 1hour 30 mins before I stopped it and had no errors.
    Did you run all 13 tests including the hammer test? I found several systems successfully ran for hours, but eventually failed while running #13 hammer test.
    AMD Athlon 4850e (2009), openSUSE 13.1, KDE 4, Intel i3-4130 (2014), i7-6700K (2016), i5-8250U (2018), AMD Ryzen 5 3400G (2020), openSUSE Tumbleweed, KDE Plasma 5

  8. #8
    Join Date
    Jan 2016
    Location
    UK
    Posts
    754

    Default Re: AMDGPU errors and occasional crashes/hangs

    Quote Originally Posted by karlmistelberger View Post
    Did you run all 13 tests including the hammer test? I found several systems successfully ran for hours, but eventually failed while running #13 hammer test.
    Yes it ran everything once and was up to test 5 on the second run.

    Stuart

  9. #9
    Join Date
    Jan 2014
    Location
    Erlangen
    Posts
    2,671
    Blog Entries
    1

    Default Re: AMDGPU errors and occasional crashes/hangs

    Quote Originally Posted by broadstairs View Post
    Yes it ran everything once and was up to test 5 on the second run.

    Stuart
    Be patient and run it overnight.Sometimes replacing RAM modules helps, even if no errors are reported by memtest. I have never seen any of your errors on my system. To me it looks like a hardware problem. Disconnect all components and reconnect only what is needed. Temporarily omit optional components, such as all but one RAM module. Check temperatures, reset efi to optimized defaults ...
    AMD Athlon 4850e (2009), openSUSE 13.1, KDE 4, Intel i3-4130 (2014), i7-6700K (2016), i5-8250U (2018), AMD Ryzen 5 3400G (2020), openSUSE Tumbleweed, KDE Plasma 5

  10. #10

    Default Re: AMDGPU errors and occasional crashes/hangs

    Quote Originally Posted by karlmistelberger View Post
    Be patient and run it overnight.Sometimes replacing RAM modules helps, even if no errors are reported by memtest. I have never seen any of your errors on my system. To me it looks like a hardware problem. Disconnect all components and reconnect only what is needed. Temporarily omit optional components, such as all but one RAM module. Check temperatures, reset efi to optimized defaults ...
    , and update BIOS.

Page 1 of 5 123 ... LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •