mce_notify_irq: 2 callbacks suppressed followed by system forced reboot - Hardware Error

I’ve had this problem where my laptop will reboot itself when playing games. At first, I thought it was an overheating issue because it seems to happen when running multiple applications are running in the background, but they hardly use any of the CPU. Checking the logs and some Google searches later I found a few mentions from as far back as kernel 4.11. Maybe it’s a longstanding kernel bug? I’m wondering if it’s a power distribution problem where the integrated GPU uses too much power and the CPU dies.

Laptop is a Lenovo Z13.


lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 7 PRO 6860Z with Radeon Graphics
    CPU family:          25
    Model:               68
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            1
    Frequency boost:     enabled
    CPU max MHz:         4768.0000
    CPU min MHz:         400.0000



Anyway, before forced reboot “journalctl -b -1” shows the following. The “MCE #” notifications repeat continuously without issue before it eventually snaps with “callback suppressed” followed by the hardware errors below.


mcelog[1200]: MCE 3mcelog[1200]: CPU 0 BANK 18
mcelog[1200]: MISC d01a000001000000 ADDR 1f7cee380
mcelog[1200]: TIME 1661086364 Sun Aug 21 21:52:44 2022
mcelog[1200]: STATUS dc204000000c011b MCGSTATUS 0
mcelog[1200]: MCGCAP 119 APICID 0 SOCKETID 0
mcelog[1200]: MICROCODE a404102
mcelog[1200]: CPUID Vendor AMD Family 25 Model 4 Step 1
kernel: mce_notify_irq: 2 callbacks suppressed
kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 15: dc204000000c011b
kernel: mce: [Hardware Error]: TSC 0 ADDR 1ff8014c0 MISC d01a000001000000 SYND 1ff0a240700 IPID 9600050f00 
kernel: mce: [Hardware Error]: PROCESSOR 2:a40f41 TIME 1661086364 SOCKET 0 APIC 0 microcode a404102
kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 16: dc204000000c011b
kernel: mce: [Hardware Error]: TSC 0 ADDR 1ff801400 MISC d01a000001000000 SYND 1ff0a240700 IPID 9600150f00 
kernel: mce: [Hardware Error]: PROCESSOR 2:a40f41 TIME 1661086364 SOCKET 0 APIC 0 microcode a404102
kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 17: dc204000000c011b
kernel: mce: [Hardware Error]: TSC 0 ADDR 1f7cecb00 MISC d01a000001000000 SYND 1ff0a240701 IPID 9600250f00 
kernel: mce: [Hardware Error]: PROCESSOR 2:a40f41 TIME 1661086364 SOCKET 0 APIC 0 microcode a404102
kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 18: dc204000000c011b
kernel: mce: [Hardware Error]: TSC 0 ADDR 1f7cee380 MISC d01a000001000000 SYND 1ff0a240701 IPID 9600350f00 
kernel: mce: [Hardware Error]: PROCESSOR 2:a40f41 TIME 1661086364 SOCKET 0 APIC 0 microcode a404102
mcelog[1200]: Hardware event. This is not a software error.
mcelog[1200]: MCE 0

This https://forum.manjaro.org/t/system-auto-rebooted-mce-hardware-error-in-dmesg-related-to-cpu/89580 appears to be the same issue on another high-spec Ryzen CPU (5900X, 5950X).

It might be worth trying mce options from https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt .

Which ones and what would it achieve?