Intermittent grey screen crashing with audio stutter [AMDGPU]

Hello all,

I have had this problem for a little while that I have needed help on for a bit, but recently I decided that enough was enough and I needed to get real help for this because it got more and more noticeable recently. Whenever I do anything 3d related (when in source engine games or the hammer editor) there is always a chance of this random crash (that locks up the system with a grey screen, loops the last played audio for a couple seconds, and then video signal is lost and all my fans are at 100%) to happen. I put this in the hardware section because no matter where I looked, I could NOT find any indication in any logs saying that this was software related. I checked journalctl, /var/crash/, dmesg, and even boot.msg. Nothing I checked had any evidence of crashing and simply showed me the normal boot process with AMDGPU drivers and video modes being set just fine. Due to this, I have tried dividing it up by each part of my system to see what the issue could be.

CPU: I have a Pinnacle Ridge ryzen processor (Ryzen 5 2600x) clocked at stock speed. The last couple months or so however I noticed that the stock cooler just was not cutting it and while doing some 3d applications, such as gaming, temps were increased to 70-80 degrees c. I took the liberty of replacing my stock AM4 cooler recently with a Scythe Fuma 2, and have had great temperatures since (idle in the 30s and when gaming highest I saw it go was about 60 degrees c). I do not believe overheating of this component could be a cause for a full system crash of this caliber to occur.

Ram: Corsair Vengeance Ram rated for 2400mhz CL14. Ran it in memtest86 for 3 hours. 4 passes. Zero errors.

GPU: AMDGPU VEGA 10 Architecture Vega 56 is what I have had for the past two years. On my windows install (on a separate drive) I have had it crash on me a couple times but that was due to a faulty PSU/overclock, and while the crashes were similar (on windows it gave me a red screen for a minute or two) it actually came back to the desktop saying that radeon software had to recover. Other than that however this card has been fine overall. There is no visible artifacting while displaying 3d content and the cooler keeps the card at around 60 degrees c while gaming. I even set the bios (card has a bios switch option) to the silent mode to see if it had made a difference, and it did not. I have not overclocked the card since using it on windows and other than the crashing it has never gave any errors in the log whatsoever as far as I could tell. It is connected to my monitor via hdmi, which I know kde has a hard time with because when I connected it to another monitor via displayport it could actually handle being shut off without giving me a lost signal and requiring a restart. I have also enabled amd psp in my bios after disabling it for security reasons to see if that caused any potential conflicts with the video drivers. Will have to update this post if this causes a crash too. I am using the normal AMDGPU driver that came withKernel 5.3.18-lp152.72-default.

PSU: On the topic of PSU I went from a 600 watt bronze rated psu to a 850 watt gold rating to ensure longevity of my machine in the case I upgraded. I do not think any errors could come from this as it appears to be functioning normally. As mentioned NONE of my components are overclocked.

SSD: I have OpenSuSE leap installed on a samsung 970 evo drive with a btrfs file system. I have not seen any disk related errors being mentioned and read/write speeds are normal. Temps appear to be around 30-40 degrees c at all times. The only comment that can be made about this is the screw used to hold it in place is SLIGHTLY longer than a normal m.2 screw, but I do not believe that this could cause errors because it is quite secure, and I am receiving nothing in the logs relating to drive or filesystem errors.

I have covered pretty much everything I have found about this issue. Only other option I could think of is a potential bios flash but it is not recommended by my motherboards manufacturer (AsRock) that I upgrade it unless I have a 3000 series cpu or newer. I have tried getting help from the docs to no avail which is why I’m asking about it here. What else is there to do regarding an error to this extent? In my several years of repairing and troubleshooting computers, I would have never guessed my own personal computer would give me the hardest, ongoing issue that I have ever dealt with. Please help.

In case it’s a kernel bug, following these instructions about cmdline options may put clues into dmesg that you wouldn’t otherwise see:

  • Always add drm.debug=0xe to the kernel command line to get details in kernel log*]When attaching the dmesg also make sure that it is complete and contains everything from the first boot messages. If there’s too much in dmesg so that the kernel dmesg buffer overflows and overwrites early boot messages, you can extend the dmesg buffer by adding log_buf_len=4M on the kernel cmdline. Increase the size even more if that’s not enough.

There is another X graphics driver (Modesetting DIX) you could try instead of the Amdgpu DDX driver. See OP of this primer for what and how. If it was to solve your problem, it would likely best be reported on the mailing list for AMD driver developers to see if and where a bug report might best be placed (upstream vs openSUSE).

Kernel options that might be worth trying are here.

  1. Update BIOS. Can you update BIOS in graphics card?
  2. Check RAM speed in BIOS. Ryzen 5 2600X uses 2933 MHz by default (up to…).
  3. Use UPS. Proper grounding may also help.
  4. Clean contacts (especially for RAM modules) with alcohol.
  5. Detach unneeded hardware. Reattach all cables.
  6. Try to use Leap 15.3.
  7. Try to use drivers from AMD: https://www.amd.com/en/support/graphics/radeon-rx-vega-series/radeon-rx-vega-series/radeon-rx-vega-56
  8. Possibly you need to change thermal grease and thermal pads for your graphics card.

I hope additional power cables are attached to a graphics card.

Currently experimenting with kernel parameters that I have noted people with similar experiences on vega cards have tried (found here) and I have not experienced any problems so far. If I receive another crash I will disable bapm and then look into the modesetting driver along with providing output for the crashlog with the cmdline options. Thanks for your input!

And to Svyatko, I am currently using a UPS, and have cleaned my system thoroughly. Thank you for your input as well! I’m not sure if the drivers from AMD’s website would be too different from those that come with the kernel though. Would switching to AMDGPU-PRO make a difference?

Did you do this after the problem first surfaced, or after? If after, and you still have the 600 available, try it. Just because of a higher advertised rating doesn’t mean the 850 can’t have a defect that only surfaces as demand increases, or at random. Excess voltage can cause overheating of nearly any electronic component. It only takes one little component seeing an out of range voltage spike or drop to crash a PC. Such could happen in the PS or the motherboard. What are the feedback ratings of those two power supplies?

Absence of any clues in any logs often suggests a hardware rather than software issue.

The meaning of all Svyatko wrote may not be clear. If a graphics card has a power plug, be sure to use it, and that it is receiving the required voltage while in use.

You need AMD OpenCL drivers to use OpenCL with AMD graphics chip.
With installing AMD drivers you may get newer firmware. Also you may get this firmware from AMD drivers without installing AMD drivers.
For graphics you may choose between Mesa 3D or AMD drivers.