First some background on the relevant hardware and software: My OS is openSUSE Tumbleweed x64, Kernel 5.8.10, Mesa 20.1.8, xf86-video-amdgpu 19.1.0, amdgpu module in use. My graphics card is a Radeon XFX R9 390.
Starting roughly one or two months ago, my video card is causing the system to crash due to overheating. Once the edge sensor reads around 90C* I start seeing little flashing squares corrupting my screen. Soon afterward, if I don’t quickly reduce the load, the system crashes and reboots on its own then refuses to start up and reach POST for several minutes (I get 3 PC-speaker beeps and the machine won’t boot once powered on). This makes working with anything that stresses the GPU a danger including most 3D engines. The only workaround I found is using the command:
echo low | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level
By forcing the performance level to the lowest frequencies I’m able to safely run most games, but of course they run extremely slow so this is not a real solution. It does however confirm that the way GPU / VRAM frequencies are set is causing overheating and a system crash.
I’m filing a bug report with FreeDesktop as I believe this is a driver issue rather than a hardware failure; It didn’t happen until several weeks ago, and I remember how an year prior my card used to reach 94C* at times without any graphical corruption or crashing. I can also verify that the two GPU fans are working well, though it takes a very long time for them to come on at full power (which I know is also controlled by the power management module). I get the impression the default parameters might not be configured accordingly with the latest versions of the modules, causing the card to get overclocked and reach a dangerous temperature very quickly.
Please let me know how I can offer more needed info to better understand where this issue resides, such as which part of the driver is overestimating the safe frequencies of my graphics card model and pushing it too far. Also is there a way to tell amdgpu to cap the clocks at a particular (lower) frequency to make sure it doesn’t reach the point where it gets overly hot? Just as importantly, how do I tell it to start the fans at full speed sooner?