amdgpu is overheating the graphics card and crashing the system (Radeon XFX R9 390)

First some background on the relevant hardware and software: My OS is openSUSE Tumbleweed x64, Kernel 5.8.10, Mesa 20.1.8, xf86-video-amdgpu 19.1.0, amdgpu module in use. My graphics card is a Radeon XFX R9 390.

Starting roughly one or two months ago, my video card is causing the system to crash due to overheating. Once the edge sensor reads around 90C* I start seeing little flashing squares corrupting my screen. Soon afterward, if I don’t quickly reduce the load, the system crashes and reboots on its own then refuses to start up and reach POST for several minutes (I get 3 PC-speaker beeps and the machine won’t boot once powered on). This makes working with anything that stresses the GPU a danger including most 3D engines. The only workaround I found is using the command:

echo low | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level

By forcing the performance level to the lowest frequencies I’m able to safely run most games, but of course they run extremely slow so this is not a real solution. It does however confirm that the way GPU / VRAM frequencies are set is causing overheating and a system crash.

I’m filing a bug report with FreeDesktop as I believe this is a driver issue rather than a hardware failure; It didn’t happen until several weeks ago, and I remember how an year prior my card used to reach 94C* at times without any graphical corruption or crashing. I can also verify that the two GPU fans are working well, though it takes a very long time for them to come on at full power (which I know is also controlled by the power management module). I get the impression the default parameters might not be configured accordingly with the latest versions of the modules, causing the card to get overclocked and reach a dangerous temperature very quickly.

Please let me know how I can offer more needed info to better understand where this issue resides, such as which part of the driver is overestimating the safe frequencies of my graphics card model and pushing it too far. Also is there a way to tell amdgpu to cap the clocks at a particular (lower) frequency to make sure it doesn’t reach the point where it gets overly hot? Just as importantly, how do I tell it to start the fans at full speed sooner?

I assume the GPU card has a fan??? If so does it work???

Hi
Any options set, assuming you are using the amdgpu driver as opposed to radeon?


/sbin/lspci -nnk | egrep -A3 "VGA|Display|3D"
cat /etc/default/grub | grep GRUB_CMDLINE_LINUX_DEFAULT

Two fans: One is on all the time, the other only comes on at very hot temperatures (by design). Only noticeable fan issue is that the card must heat a lot (past 90C*) for me to start hearing the fans going loud… they are controlled by the driver (amdgpu pwm) meaning the module waits for an overly long time before fully turning them on.

mircea@linux-qz0r:~> cat /etc/default/grub | grep GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="splash=silent quiet radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1 mitigations=auto"
mircea@linux-qz0r:~> /sbin/lspci -nnk | egrep -A3 "VGA|Display|3D"
0a:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii PRO [Radeon R9 290/390] [1002:67b1] (rev 80)
        Subsystem: XFX Pine Group Inc. Device [1682:9390]
        Kernel driver in use: amdgpu
        Kernel modules: radeon, amdgpu

Hi
Can you blacklist the radeon, and only have amdgpu.si=1 set cik to 0. After you blacklist, run mkinitrd. See how that goes for the moment. Can you also ensure xf86-video-ati is not installed (add a lock if necessary).

Some important news: It appears this may in fact be an issue with amdgpu specifically. I booted my system on the radeon module by temporarily removing the kernel parameters “radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1”. The console output and renamed temperature sensor confirmed the switch.

mircea@linux-qz0r:~> /sbin/lspci -nnk | egrep -A3 "VGA|Display|3D"
0a:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii PRO [Radeon R9 290/390] [1002:67b1] (rev 80)
        Subsystem: XFX Pine Group Inc. Device [1682:9390]
        Kernel driver in use: radeon
        Kernel modules: radeon, amdgpu

Versus (on amdgpu):

mircea@linux-qz0r:~> /sbin/lspci -nnk | egrep -A3 "VGA|Display|3D"
0a:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii PRO [Radeon R9 290/390] [1002:67b1] (rev 80)
        Subsystem: XFX Pine Group Inc. Device [1682:9390]
        Kernel driver in use: amdgpu
        Kernel modules: radeon, amdgpu

I then proceeded to open up Blender and load up an Eevee scene, one that I know would overheat my GPU within seconds if the viewport was set to rendered view. This time however there were no issues! I got some stretched vertices likely caused by another unrelated bug, but no crashes or the square glitches caused by overheating as I moved the view around.

Watching the sensors in a console explains why: The GPU was never allowed to exceed 84C* on radeon, unlike the 94C* it will get to on amdgpu… precisely the safe temperature I noticed for my card, the square glitches will occur starting from 88C*. Just as importantly, the optional fan on the card started spinning soon after 80C*; On amdgpu it doesn’t spin until over 90C* instead which is extremely high.

Already I can see something out of the ordinary in the outputs. Here’s a snapshot of the “watch sensors” command while on the radeon module (under load by Blender):

Every 2.0s: sensors                                                                                      linux-qz0r: Fri Oct  2 22:47:34 2020

k10temp-pci-00c3
Adapter: PCI adapter
Vcore:         1.32 V
Vsoc:          1.09 V
Tctl:         +61.2°C
Tdie:         +61.2°C
Tccd1:        +52.5°C
Icore:         7.00 A
Isoc:          8.00 A

nvme-pci-0100
Adapter: PCI adapter
Composite:    +54.9°C  (low  = -273.1°C, high = +84.8°C)
                       (crit = +84.8°C)
Sensor 1:     +54.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +54.9°C  (low  = -273.1°C, high = +65261.8°C)

radeon-pci-0a00
Adapter: PCI adapter
temp1:        +82.0°C  (crit = +120.0°C, hyst = +90.0°C)

Now here’s what the same command looks like while on amdgpu:

Every 2.0s: sensors                                                                                      linux-qz0r: Fri Oct  2 23:14:22 2020

k10temp-pci-00c3
Adapter: PCI adapter
Vcore:         1.32 V
Vsoc:          1.09 V
Tctl:         +44.8°C
Tdie:         +44.8°C
Tccd1:        +46.2°C
Icore:        10.00 A
Isoc:          8.00 A

nvme-pci-0100
Adapter: PCI adapter
Composite:    +47.9°C  (low  = -273.1°C, high = +84.8°C)
                       (crit = +84.8°C)
Sensor 1:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +49.9°C  (low  = -273.1°C, high = +65261.8°C)

amdgpu-pci-0a00
Adapter: PCI adapter
vddgfx:        1.04 V
fan1:             N/A  (min =    0 RPM, max = 6000 RPM)
edge:         +58.0°C  (crit = +104000.0°C, hyst = -273.1°C)
power1:       59.15 W  (cap = 208.00 W)

Notice the GPU sensor (named temp1 on radeon and edge on amdgpu): (crit = +120.0°C, hyst = +90.0°C) in the first, (crit = +104000.0°C, hyst = -273.1°C) on the later. The temperature ranges in the second version seem like broken values! Could this be a clue?

Until that is solved I can use the legacy radeon module as a workaround if need be. However I don’t wish to do so for too long: It’s an older driver, slower, and the lack in performance improvements and outdated architecture will likely show in modern games. amdgpu is the normal driver even if it’s still not enabled by default on GCN 1.0 / 2.0 cards, and apart from this issue it’s working perfectly otherwise.

Just ruled out DisplayCore as well since I know it’s an usual suspect: Booting with the additional parameter “amdgpu.dc=0” does not affect the issue, the card will still overheat on amdgpu but not radeon.

One of the amdgpu maintainers needs help with some openSUSE specific information in figuring this out. If anyone knows the answer please reply so I can forward it to him.

https://gitlab.freedesktop.org/drm/amd/-/issues/1315#note_642829

TLDR: Please confirm which of the following patches are included in the latest official kernel of openSUSE Tumbleweed, presently snapshot 20200930 / kernel 5.8.10:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f7b2e34b4afb8d712913dc199d3292ea9e078637
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a84ae87fe6123f9521eea48e405f8ad74e2b8ad
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d98299885c9ea140c1108545186593deba36c4ac
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f87812284172a9809820d10143b573d833cd3f75
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7ce016e71a8e8db239d0113e06a47fdf60fd8ea3

Radeon has multiple meanings, as does amdgpu. It might be worth trying both kernel drivers along with the upstream default DDX driver, Modesetting. The Modesetting DDX can be enabled either via /etc/X11/xorg.con* containing ‘Driver “modesetting”’ in 'Section “Device”, or uninstalling the DDX(es) xf86-video-ati and/or xf86-video-amdgpu.

Since this seemed relevant, here’s the output of “cat /sys/kernel/debug/dri/0/amdgpu_pm_info” which shows clock related video card settings. Ran that command while stressing the GPU with Blender in case that gives off more info.

Clock Gating Flags Mask: 0x0
        Graphics Medium Grain Clock Gating: Off
        Graphics Medium Grain memory Light Sleep: Off
        Graphics Coarse Grain Clock Gating: Off
        Graphics Coarse Grain memory Light Sleep: Off
        Graphics Coarse Grain Tree Shader Clock Gating: Off
        Graphics Coarse Grain Tree Shader Light Sleep: Off
        Graphics Command Processor Light Sleep: Off
        Graphics Run List Controller Light Sleep: Off
        Graphics 3D Coarse Grain Clock Gating: Off
        Graphics 3D Coarse Grain memory Light Sleep: Off
        Memory Controller Light Sleep: Off
        Memory Controller Medium Grain Clock Gating: Off
        System Direct Memory Access Light Sleep: Off
        System Direct Memory Access Medium Grain Clock Gating: Off
        Bus Interface Medium Grain Clock Gating: Off
        Bus Interface Light Sleep: Off
        Unified Video Decoder Medium Grain Clock Gating: Off
        Video Compression Engine Medium Grain Clock Gating: Off
        Host Data Path Light Sleep: Off
        Host Data Path Medium Grain Clock Gating: Off
        Digital Right Management Medium Grain Clock Gating: Off
        Digital Right Management Light Sleep: Off
        Rom Medium Grain Clock Gating: Off
        Data Fabric Medium Grain Clock Gating: Off
        Address Translation Hub Medium Grain Clock Gating: Off
        Address Translation Hub Light Sleep: Off

GFX Clocks and Power:
        1500 MHz (MCLK)
        999 MHz (SCLK)
        300 MHz (PSTATE_SCLK)
        150 MHz (PSTATE_MCLK)
        1206 mV (VDDGFX)
        206.147 W (average GPU)

GPU Temperature: 88 C
GPU Load: 100 %
MEM Load: 12 %

UVD: Disabled

VCE: Disabled 


Some nice news: I was able to improve the situation on the hardware side, by removing the radiator and applying new thermal paste to the GPU. The old one had dried off and solidified in places… on top of that some radiator screws were worryingly lose and I tightened them. A positive outcome is already noticeable: Idle temp seems to be a tiny bit lower, the card takes a longer time to heat up, and although the “edge” sensor will still cap out at 94C* I no longer seem to get the square corruption and system crash right away. I only ran a brief test and extra sustained heating may still cause issues, but new thermal paste definitely helped for now.

It remains arguable whether amdgpu still has a fault for not picking up on the problem and doing something to prevent it: Letting the card heat up to 94C* still feels too much in my opinion, even if it’s probably allowed to reach this temperature by design. That still feels too hot especially if the driver doesn’t have a way to detect when this heat is about to cause a system failure, putting other cards with worn thermal paste or lowered heat tolerance in danger. What do you think?

I do believe the amdgpu PWM module has at least one problem: It takes far too long to turn on the secondary fan. Even after the sensor reached the maximum temperature mentioned, I only heard the back fan briefly come on for a few seconds after it stood at that temperature for a while. What decides when the fans will run at full power, and could the driver be tweaked to make this happen a little sooner for safe measure?

Hi
Making progress :slight_smile: Yes, you need to go and tweak the pwmconfig file to your requirements. Have a read of fancontrol man page as well…

Videocard BIOS updating possibly will help.

How to apply thermal compound: Make sure to spread a thin layer evenly over the entire surface. Tighten the screws evenly.

https://www.youtube.com/watch?v=p7zlEQUBFYU

How to apply thermal compound:

https://youtu.be/bCh2oURSuZc?t=190

https://www.youtube.com/watch?v=NlqKWqoibJI