"Machine Check", Freeze

Hi!

I’ve been using OS Leap 15.2 for quite a while on an AMD Ryzen 7 1700 Eight-Core Processor.

It will sometimes report a “Machine Check” and reboot. Other times it will just “freeze”. When it “freezes” even an LED on the keyboard will not respon. E.G. if I press “Caps Lock”, the associated LED will not light.

I have noticed these other things:

  1. WRT “Machine Check”, not always, but often, the same bank of memory is identified in the system log message. But, memtest86 has not found any errors.

  2. Leap 15.2 is treating the CPU as though it has 16 cores, rather than 8. I believe AMD claims that each core is, in some sense, multi-threaded, and I wonder, is that’s why it’s being treated as though it has 16 cores? Beyond that I’m concerned, could some issue with treating an 8 core CPU as though it has 16 cores, cause the problem? Perhaps some conflict, with accessing the same bank of memory? Is there some way to restrict the way the OS defines a core, to be only a physical core?

  3. Since the CPU is AMD, I’ve been somewhat surprised to see an RPM during updates, labeled as having “Intel” microcode. Could that somehow cause the problem?

Any helpful insights would be greatly appreciated!

Toes

Since a few weeks I have an “AMD Ryzen 7 PRO 4750U with Radeon Graphics” laptop, running TumbleWeed. It is an eight core processor, each multi-treaded, so reported as 16 CPUs.

For how long did memtest86 run? Let it run for a long period, over night. Sometimes errors only show up after a long time.
And what is the system log message?

It is completely normal that 16 core are reported.

I’ve seen that on my system too, and it (slightly) surprised me as well. “zypper search -i intel” shows that the following packages are installed:

S | Name                  | Summary                                                     | Type
--+-----------------------+-------------------------------------------------------------+--------
i | kernel-firmware-intel | Kernel firmware files for Intel-platform device drivers     | package
i | libdrm_intel1         | Userspace interface for Kernel DRM services for Intel chips | package
i | libdrm_intel1-32bit   | Userspace interface for Kernel DRM services for Intel chips | package
i | ucode-intel           | Microcode Updates for Intel x86/x86-64 CPUs                 | package

Trying to remove libdrm_intel1 shows that 31 packages and 5 patterns would be removed, including x11, so that’s a no go. Only ucode-intel seems to be able to be removed without removing other packages. But I would not bother.

PCU,

Thanks for responding. These are the logged, machine check, messages:

Feb 27 15:57:01 dev kernel: mce: [Hardware Error]: Machine check events logged
Feb 27 15:57:01 dev kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108
Feb 27 15:57:01 dev kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffaa000ae6 MISC d012000101000000 SYND 4d000000 IPID 500b000000000 
Feb 27 15:57:01 dev kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1614463016 SOCKET 0 APIC 3 microcode 8001126

The shown bank of memory, is the bank that is most often, not always identified, when there is a machine check. I’ve let memtest run for a good while, quite a few iterations. But I’ll let it run longer.

As far as the intel microcode, sorry, I’m often in too much of a hurry to accomplish something:

# rpm  -e --test libdrm_intel1-2.4.100-lp152.1.4.x86_64
error: Failed dependencies:
            libdrm_intel.so.1()(64bit) is needed by (installed) Mesa-dri-19.3.4-lp152.27.1.x86_64

which in turn clearly points to the video:

# qfl Mesa-dri-19.3.4-lp152.27.1.x86_64
/usr/lib64/dri/i915_dri.so
/usr/lib64/dri/i965_dri.so
/usr/lib64/dri/iris_dri.so
/usr/lib64/dri/kms_swrast_dri.so
/usr/lib64/dri/r200_dri.so
/usr/lib64/dri/r300_dri.so
/usr/lib64/dri/r600_dri.so
/usr/lib64/dri/radeon_dri.so
/usr/lib64/dri/radeonsi_dri.so
/usr/lib64/dri/swrast_dri.so
/usr/lib64/dri/virtio_gpu_dri.so
/usr/lib64/dri/vmwgfx_dri.so

I thought that in some version of SUSE, there used to be a visual dependency tool built into YAST. But if so, I don’t know what happened to it.

Toes

Let memtest run overnight, actually for more then 11 hours, no errors.

Check if mitigations=none instead of auto on Grub’s linu… line makes a difference. Change it temporarily by using the E key on your Grub menu selection. Also you could try some of these AMD-specific boot options.

mrmazda,

Thanks for your response.

I’ve sometimes said that ‘I wish there was a “Thesaurusizing” Search Engine which would “automagically” show results found using synonyms of the search terms entered’. Before I started this thread, I searched trying to find something that might help. I didn’t find anything. I did not even find similar complaints.

Some time later, I stumbled across things that let me change the search terms and find a variety of similar complaints. It may be that I just didn’t happen to find complaints relating to newer generations of CPU’s, but so far, I believe that what I’ve seen in search results seems to be with somewhat older CPU’s which have “multi-threaded cores”.

Also, although I may not have done a good job expressing it, AFAIK the freezes have not involved the “machine checks”. Generally the system will reboot in response to catching the machine checks. In case it’s an actual problem, not just something that caught and completely handled, I’d rather not just suppress the machine check response.

The various machine check complaints which I’ve found posted elsewhere by other folks, often involve the same location in memory that’s usually involved in what I’ve seeing.

Strictly speaking, I guess it’s not impossible that there a whole bunch of memory sticks that were manufactured with an error at the same location.

The machine check complaints that other folks have posted involving various Linux “distros.”, also seem to involve the same core as with the machine checks I’m seeing.

All in all, for now, I’ve disabled SMT for the CPU with the BIOS settings, to see what that might change.

Yet I’m still interesting in hearing about other possibilities concerning
Then too, I have nothing but vagaries when it comes to the freezes. Sometimes I’ve seen all sorts of vague log messages concerning the windowing arena, in the general time frame of the freeze, but seemingly nothing very concrete. Especially since for years and years I feel I’ve seen vague windowing related log messages, I think stretching all the way back to when I was purchasing SUSE, not openSUSE, the “Professional” version of SUSE, in box, at a store.

Any help would be greatly appreciated.

toes

================================================================================================
SORRY, I HAD A FREEZE AS I WAS EDITING THE MESSAGE, AND I BELIEVE THERE ARE LIMITS, TIME WISE, TO EDIT.
STAAART AGAIN!

mrmazda,

Thanks for your response.

I’ve sometimes said that ‘I wish there was a “Thesaurusizing” Search Engine which would “automagically” show results found using synonyms of the search terms entered’. Before I started this thread, I searched trying to find something that might help. I didn’t find anything. I did not even find similar complaints.

Some time later, I stumbled across things that let me change the search terms and find a variety of similar complaints. It may be that I just didn’t happen to find complaints relating to newer generations of CPU’s, but so far, I believe that what I’ve seen in search results seems to be with somewhat older CPU’s which have “multi-threaded cores”.

Also, although I may not have done a good job expressing it, AFAIK the freezes have not involved the “machine checks”. Generally the system will reboot in response to catching the machine checks. In case it’s an actual problem, not just something that’s caught and completely handled, I’d rather not just suppress the machine check response.

The various machine check complaints which I’ve found posted elsewhere by other folks, often involve the same location in memory that’s usually involved in what I’ve been seeing.

Strictly speaking, I guess it’s not impossible that there are a whole bunch of memory sticks that were manufactured with an error at the same location.

The machine check complaints that other folks have posted involving various Linux “distros.”, also seem to involve the same core as with the machine checks I’m seeing.

All in all, for now, I’ve disabled SMT for the CPU with the BIOS settings, to see what that might change.

Yet I’m still interesting in hearing about other possibilities concerning the machine checks.

Then too, I have nothing but vagaries when it comes to the freezes. Sometimes I’ve seen all sorts of vague log messages concerning the windowing arena, in the general time frame of the freeze, but seemingly nothing very concrete. Especially since for years and years, I feel I’ve seen vague windowing related log messages, I believe stretching all the way back to when I was purchasing SUSE, not openSUSE, the “Professional” version of SUSE, in box, at a store. But any issues associated with the messages, didn’t, until somewhat recently, seem to cause any significant problems.

Any help on the machine checks or freezes, would be greatly appreciated.

toes

  • Is there more than one memory module installed and would it be possible to take one out and first use one module and later the other, to see if the freezes can be correlated with one module?
  • What is the output of “inxi -m”? (run as root, shows memory information).
  • Have you checked BIOS settings? Have memory settings been changed? Take note of current settings and switch settings to default.

Just some things that might try.

https://en.wikipedia.org/wiki/Machine-check_exception
https://en.wikipedia.org/wiki/Machine-check_exception#Programs_to_decode_Intel_and_AMD_MCEs

https://www.kernel.org/doc/Documentation/x86/x86_64/machinecheck
https://wiki.archlinux.org/index.php/Machine-check_exception
https://www.cnx-software.com/2019/07/17/machine-check-exception-mce-errors-linux/
https://askubuntu.com/questions/605369/mce-hardware-error-machine-check-events-logged-appears-in-syslog-what-sho

https://software.opensuse.org/package/mcelog
https://software.opensuse.org/package/rasdaemon

  1. Possible memory fault:

Detach memory module, clean its contacts with rubber, then attach it again.
Try to install memory modules in a different slots.
Read mobo’s manual which slots are preferable. Often you need 4th slot (counting from processor) for 1 memory module and 2 + 4 slots for 2 modules.

  1. Possible BIOS fault: update BIOS.
    Use default BIOS settings.

  2. Possible processor fault: update BIOS, try to use TW, change CPU.
    First Ryzens had (maybe still have) some faults. Some processors were replaced, some could get new microcode.

  3. Possible power fault: use UPS, change PSU, detach all additional cards.

Some tests (Testmem5): https://testmem.tz.ru/soft.htm .

PCU,

Thanks for your response.

There are two memory modules. But I believe that the motherboard wishes the modules to be used in pairs. The motherboard has 4 memory slots, and if we number them 1 through 4, the tabs, etc., on # 1 and # 3 are the same color, likewise with # 2 and # 4. Also:

 # inxi -m
Memory:    RAM: total: 15.58 GiB used: 831.3 MiB (5.2%)
           Array-1: capacity: 64 GiB slots: 4 EC: None
           Device-1: DIMM 0 size: No Module Installed
           Device-2: DIMM 1 size: 8 GiB speed: 2400 MT/s
           Device-3: DIMM 0 size: No Module Installed
           Device-4: DIMM 1 size: 8 GiB speed: 2400 MT/s

Between the time there were no machine checks and the checks started, there were no BIOS setting changes, and I don’t “overclock”.

When I downloaded memtest86 through YAST, and booted it, it didn’t respond to the USB keyboard I’m using. So it ran with only CPU #0. Running multiple times overnight, it found no errors. I went to the memtest website. If I’m thinking correctly, what I downloaded from there seems to be version 9, about 4 versions newer. I set it up on a USB SSD, and booted it. I was able to have it test memory using all the CPU’s, let it run overnight, and it reported some memory errors. That’s with SMT disabled. When I ran it a second time overnight, it found no errors.

It’s one thing if some configuration(s) can be changed, code, or microcode, updated, etc., and fix/prevent the problem, then OK. But some years ago when I bought the parts and put the machine together, even though I realize that it’s probably always sort of a “who knows?” situation, I splurged on some “protection plans” at the store where I bought the parts, and I believe they will expire before too long. So first and foremost, I’m trying to determine if there is any hardware which needs to be replaced, so I can do that before the plans expire.

svyatko,

Thanks for your response.

That’s a quite a collection of information!

I’ve been tempted to make better use of the “mce” stuff. But I’ve been busy and perhaps didn’t make it enough of priority. Thanks for mentioning it; that sort of gave me a push. I’ll make it more of a priority. Also thanks to mrmazda for mentioning GRUB configuration issues, and configurable “mce” options.

I’ve been tempted to re-seat the memory sticks just in case there are any related issues. I could be confused, but I thought that the older “Soldercoat” pins are what tended to oxidize, that the newer gold coat stuff, tended not to do so. But I guess it wouldn’t hurt to try to clean the pins, just in case. Swapping the memory sticks could be interesting too.

AFAIK, the memory sticks are in the preferred slots. I have been tempted to try the other slots; thanks for the push on that as well.

As to the BIOS, there are updates, although it seems that there’s a collection of cascaded updates, so I admit since I’ve been busy, I’ve been putting it off. Then too, it doesn’t seem that it should affect my situation, if I’m understanding correctly. Also, yeeeeeeears ago, if someone was running MS-Windows instead of Linux, it was one thing when old versions of MS-Windows effectively ran on top of MS-DOS, which in turn effectively ran on top of the BIOS. But I’m under the distinct impression that these days, when running Linux, after the boot is finished, Linux drivers/etc., have taken over from the BIOS. So it’s largely a matter of configuration. I could be missing it, but I didn’t think I was seeing anything in the reasons for the BIOS updates that should have an impact here.

I didn’t think that there was anything about problems with the CPU I’m using, when I looked on the AMD website. But of course, I could have missed it. Also, given what you’re saying about some Ryzens being replaced, and that the CPU in use shows up as “model: 0x1, stepping: 0x1” I do have to wonder if it can be replaced under some warranty/protection-plan.

As to the PSU and using a UPS, I have a PSU tester, and the machine is on a UPS. When the PSU tester is connected to the PSU, which in turn is connected to a UPS, I’ve watched repeatedly as something happens that is likely to cause a surge, such as the furnace starting. The tester detects no power problems.

I probably should mention that I’m using a Gigabyte motherboard and this: https://www.youtube.com/watch?v=I-K-Qnyu6sA claims that there might be a problem with some Ryzen 7 CPU’s combined with motherboards having stuff provided by Gigabyte.

I should also mention that I have no extra cards installed.

Hello,

I did not think that is a Linux (kernel) problem.
What MB (BIOS) do you have?
Do you use XMP (A-XMP)?

another_roadrunner,

Thanks for your response.

The MB is a Gigabyte AB350M. No XMP, no overclocking of any form.

Hello,

I would try to stick with Agesa 1.0.0.4 and BIOS default settings.
It seems that this problem appeared (again) after AMD launched Ryzen XT CPU (BIOS issues).
I had this problem too and it seems that Agesa 1.2.0.0 fixed it. So far …

If it helps you to know that you are not alone:
https://community.amd.com/t5/general-discussions/ryzen-instability-mce-bea0000000000108-what-do-do-next/td-p/73269
https://bugzilla.kernel.org/show_bug.cgi?id=206903
https://ask.fedoraproject.org/t/fedora-is-very-unstable-on-my-computer-help/9230

https://www.phoronix.com/scan.php?page=article&item=new-ryzen-fixed&num=1

Some time ago I installed the 1.0.0.4 BIOS update for the motherboard which I’ve been using all along. For each BIOS update, there appear to be multiple designations. And there is a BIOS update which seems to be designated something such as 1004, but another which is 1.0.0.4. So it can seem a little confusing. Buuuut, I’ve been using full blown SMT with the the 1.0.0.4 BIOS update for quite some time with ZERO machine checks.

As far as the machine ( or at least the GUI environment freezing ) I’m getting the impression that after some collection of OS updates, it can get worse, or better. So naturally that gives the impression that what seem like somewhat vague complaints in the some of the GUI related logging files, relate to a software problem.

One of these days I’m going to try to get into the machine remotely will it seems frozen, to try to see if it is the GUI environment, or the OS as a whole, which is frozen.

Thanks for all the help!!!