Graphics card suddenly causes boot crash with mce error

Something strange and unsettling happened to me today. I woke up to my screen no longer powering back on after moving the mouse, not an entirely unique occurrence. I restarted and was surprised to see that right before the login screen, the monitor would power itself off, and this time I was unable to do a clean shutdown by pressing the power button. It soon became apparent the computer would stay frozen for roughly a minute, then proceed to restart itself and repeat the cycle. After one restart I’m able to catch the following error message in the console:

https://i.imgur.com/zNK01Vs.jpg

I realized it must be hardware related since I didn’t install any updates nor make changes to the system configuration for over a week, this wouldn’t happen yesterday on the exact same system… to confirm it I reproduced by booting a live image, exact same behavior there. I pulled out the memory modules and tried them in sets, disconnected all hard drives, tried two different screens (HDMI and DisplayPort cables), booting two kernels (5.14 and 5.15), radeon vs amdgpu, reset the CMOS via pins… in the end the only thing that worked was removing my video card and plugging in an older one.

What makes this extremely bizarre is that I get image up until boot time: I can enter BIOS just fine, see GRUB, there are no GPU freezes or graphical corruption… this seems to be all Linux detecting an error and freaking out over it. All error messages are prefixed with “mce” and oddly enough reference a CPU issue, the rest of my hardware works just fine so it’s not the processor thank god.

Does anyone know what could break in a video card that would make Linux do this? I saw a reference about a mcelog command for these errors, but like I said the machine becomes completely inoperable after that’s printed so I can’t issue any commands. If you can suggest further tests I’ll take a look, but please mention everything I could test first as I don’t feel comfortable plugging and pulling the video card with my motherboard so often and risk breaking things (tried it twice today). If this is a hardware issue that can’t be solved from kernel I have no choice but to spend a large sum of money I didn’t want to spend… figured I’d ask for help here first so I know I tried everything else.

AFAIU nvidia/nouveau drivers are only loaded after X starts (or concomitantly). Before that your GPU is using fbdev and/or another intermediate driver. That’s why you can only use, say, 4K resolution, after the desktop starts, not in BIOS or GRUB or splash screen.

So perhaps it is a driver problem (it may have been corrupted somehow) or, more probable, a hardware fail. I’d try the card in another system, with a livecd or another driver before discarding it, however.

Which video cards?
ILL videocard hardware failure.

Hi
Maybe firmware update available?


fwupdtool get-devices

Bank 5 is RAM, maybe run memtest?

If you now put the other graphics card back, does the error duplicate?

I didn’t have time to run memtest fully yet. I did however try booting with two in two RAM boards, same issue with any combination. With another video card I have no problems, this problem only occurs when plugging in the old one.

Hi
Can you advise what card works and what doesn’t? Maybe a firmware update for the card in question may be available?

The previous card I’m falling back on is old enough that I forgot what it is. The one with the issue is an XFX AMD Radeon™ R9 390X.

Like I said I don’t suspect a software problem to be involved: It worked well for at least a week since the latest updates, haven’t changed even any configuration since the issue began. I presume it must be a physical issue with the card, but seemingly one that just throws the Linux kernel or drivers off without breaking anything else. I tried swapping cards two times, the broken one always fails to boot while the fallback always works flawlessly.

Hi
So it needs external power, that all ok (as in your power supply)?

Old (broken) card has two additional connectors, a 6-pin plus an 8-pin… the older (fallback) has only one 6-pin connector and works fine. The PSU makes them customizable (6-pin or 8-pin) so I tried reversing which are plugged into which connector last time, no effect so I don’t suspect a bad socket. New card is supposed to arrive soon, I needed an upgrade anyway, I’ll be seeing how it goes.

Hi
What’s your new card? I suspect you just need to rock on with what works. Do you have a secondary PCIe slot for the non working card, plug both in and see if it’s working as an offload device (Assuming you have spare power).

I’m waiting for someone to deliver a ROG-STRIX-RX570-O4Gsoon. I think there is an extra PCIE slot it could fit in, I’d be surprised if the main slot had any issues as it’s also where I’m running the older card, the new card should confirm this.

Hi
What I mean is, it may still be usable (if you want) as not the primary gpu (I use RX550 as primary, GT1030SC as secondary, gpu cores). I would see if there is a firmware update for the card, it may help.