ECC errors, only for Linux

shmuelmetz · February 15, 2018, 6:33pm

I have an AMD Athlon dual core desktop that I primarily use to run OS/2. Neither FreeDOS, OS/2 nor the BIOS POST memory test report any memory problem. I was able to install both 11.3 and LEAP 42.3 on this machine without any problems during installation. However, LEAP on my hard drive is now reporting ECC errors; they same thing had happened with the 11.3 install. I’m also getting messages about write errors on fd0; there’s nothing in the floppy drive, nor should there be.

Could Linux be doing something that stresses the memory more than other systems? How can I capture the log in order to post the exact error text? Thanks.

markdd · February 15, 2018, 7:36pm

I think a specialized testing tool like memtest86+ or memtest86 will yield better results than guesswork on Linux RAM usage. Get a bootable medium with memtest on it and let it run. The newer versions do support ECC RAM and should flag ECC errors during the stress test.

Memtest used to be a boot option on Opensuse install discs but I am not sure if that still is the case. Just boot the install medium and check if it is still shipped. If not, download a bootable medium.

shmuelmetz · February 18, 2018, 10:19pm

I installed memtest86+, which puts a memtest option on the grub menu. When last I looked it had run 14 hours without an error. Every time I looked it said that it was running with SMP disabled and was using only core 0.

ndc33 · February 19, 2018, 7:00am

perhaps more effective if you post the exact message from the journal

I_A · February 19, 2018, 12:34pm

memtest86 tests RAM it does not test hard drives for ecc errors for that I’d recommend checking your hard drives SMART status using a tool like smartctl
smartctl is packaged in smartmontools so you’d need to install that package first

zypper in smartmontools

if SMART is on you can get information about a particular drive by running (assuming /dev/sda is the hdd reporting ecc errors if not replace /dev/sda with the appropriate device)

smartctl --info /dev/sda

you can run a test on the drive that’s been reporting issues

smartctl -c /dev/sda

hard drives do tend to die off and usually give a lot of crc and ecc errors near the end of their life (I’ve had 3 drives die on me)
if your hard drive is dying usually you’ll see a large Reallocated Sectors Count in SMART
if your hard disk is dying your only choice is to backup your data to a working device (a different hdd or a usb/dvd device)

dcurtisfra · February 19, 2018, 3:28pm

@shmuelmetz:
I don’t really want to add to I_A’s reply but, I will:

For a summary view of a drive’s health, you should be using (with the user “root”) “smartctl --health /dev/sda” or “/dev/sdb” or “/dev/sdc/” and so on.
For a complete view of a drive’s health, you should be using “smartctl --health --all /dev/sda”.
Ditto: “smartctl --info --all /dev/sda”.
Ditto: “smartctl --capabilities --all /dev/sda” or “smartctl -c -a /dev/sda”.

dcurtisfra · February 19, 2018, 3:37pm

Which makes sense – a memory test application which uses SMP is conceivable but possibly quite complicated …

I would suggest that, the journal entries related to ECC memory messages be checked:

 # journalctl | grep -Ev 'SECCOMP|ECC_Uncorr_Error_Count|Hardware_ECC_Recovered' | grep 'ECC'

shmuelmetz · February 20, 2018, 4:30pm

The memtest ran for 24 hours, also I’m concerned as to whether test on a single core are good enough. I will extract the journal data with

journalctl | grep -Ev ‘SECCOMP|ECC_Uncorr_Error_Count|Hardware_ECC_Recovered’ | grep ‘ECC’

Do I also need to look for cache errors?

BTW, how do I enable HTML tags so I can put things in a code block? Thanks.

dcurtisfra · February 20, 2018, 4:44pm

The ‘grep’ filter I’ve proposed should pull out any memory ECC errors – the “-Ev” filter only filters out any disk related ECC messages.

Assuming that, you’re using the Web-Browser interface to this Forum, and not the News feed, in the middle row of the rich-text formatting buttons, the button with a ‘#’ will wrap

 tags around the selected test -- floating the mouse over the buttons will pop-up help text balloons.

shmuelmetz · February 20, 2018, 5:29pm

I have not been getting messages about hard drive errors, only cache and DRAM.

Fraser_Bell · February 21, 2018, 5:08am

… although not often, I have found that some bad RAM is able to duck through the memory tests without tossing up an error, somehow.

The most effective way I have found of testing memory is to pull a memory module, run for awhile, and see if the problem goes away.

That only works, of course, if you have more than one memory module, and if you do not have one of the “new, improved” laptops without removable, replaceable memory sticks.

If the problem remains, put the removed stick back in and pull the other one (or one of the other ones, if more than two) and repeat until either the problem disappears or you have tried all sticks.

Or, you could alternately pull all but one stick, and then try one at a time that way, probably a better method if the sticks are enough RAM to operate by themselves.

AND
Code tags are inserted in the message you are writing on the forums when you click on the Number Sign/Pound Signmiddle row, 3rd from right in the icons just above the editing window,.

Fraser_Bell · February 21, 2018, 5:12am

You might also try cleaning the contacts on the memory modules, can be done with an eraser. And blow out the slots.

dcurtisfra · February 21, 2018, 10:13am

Yes but, not every eraser!!! You have to be extremely careful to choose an eraser which does not contain any particles such as sand.

Be very careful to choose a soft, non-gritty, eraser!!!

If you ignore this advice, the penalty is, that you’ll remove the very thin plating (may even be gold) on the memory card contacts.

Once again, very carefully and, make sure that you use an extremely clean (oil-free) compressed air supply.

A typical workshop compressed air supply is not suited for this sort of thing.
Some laboratory compressed air supplies are suitable for this sort of thing.
For field repairs, use either a (Field Service) compressed air can «available from electronics suppliers» or, a contact cleaner (usually a spray can).

dcurtisfra · February 21, 2018, 10:27am

A further Field Service tip, mainly applicable to ECC memory modules but, can sometimes also help for “bad” non-ECC memory:

If you have more than one memory module and, no spares to hand, try swapping the modules between the slots.

[HR][/HR]Used to work wonderfully for 12-bit, 14-bit and 16-bit non-VM operating systems (OS was normally/usually/always in low memory) but, with VM operating systems this may no longer be the case.

Regardless, if the module with the “bad” cells is physically moved to another slot then, the chances are that, the operating system will be able to avoid placing user application code on those “bad” memory locations; operating systems in general, usually have problems working themselves around “bad” memory locations but, they usually do a pretty good job of managing user application code in these situations.

shmuelmetz · February 21, 2018, 5:49pm

My basic issue is not that there are single-bit errors, but that they are preventing openSUSE from initializing the desktop, while not causing problems for other operating systems.

The grep gave me messages like

Jan 20 21:27:35 linux-swrl kernel: EDAC amd64: DRAM ECC enabled.
Jan 20 21:32:12 linux-swrl mcelog[1010]:   Northbridge RAM Chipkill ECC error
Jan 20 21:32:12 linux-swrl mcelog[1010]:   Chipkill ECC syndrome = d131
Jan 20 21:34:36 linux-swrl mcelog[1010]:   Northbridge RAM Chipkill ECC error
Jan 20 21:34:36 linux-swrl mcelog[1010]:   Chipkill ECC syndrome = 11c1
Jan 20 21:35:51 linux-swrl mcelog[1010]:   Northbridge RAM Chipkill ECC error
Jan 20 21:35:51 linux-swrl mcelog[1010]:   Chipkill ECC syndrome = 1602
Jan 20 21:36:29 linux-swrl mcelog[1010]:   Northbridge RAM Chipkill ECC error

but not the messages showing the relevant addresses or the messages causing Linux to go into an emergency virtual terminal instead of initializing the desktop. Also, it does not show any messages relevant to the cache. I’ll try editing the entire file and see iuf I can spot them. Thanks.

shmuelmetz · February 22, 2018, 4:45pm

The journalctl output has messages for recoverable DRAM errors like

Jan 20 21:32:06 linux-swrl kernel: mce: [Hardware Error]: Machine check events logged
Jan 20 21:32:12 linux-swrl mcelog[1010]: Hardware event. This is not a software error.
Jan 20 21:32:12 linux-swrl mcelog[1010]: MCE 0
Jan 20 21:32:12 linux-swrl mcelog[1010]: CPU 0 4 northbridge
Jan 20 21:32:12 linux-swrl mcelog[1010]: MISC c00a016f00000000 ADDR 2a7cb510
Jan 20 21:32:12 linux-swrl mcelog[1010]: TIME 1516501926 Sat Jan 20 21:32:06 2018
Jan 20 21:32:12 linux-swrl mcelog[1010]:   Northbridge RAM Chipkill ECC error
Jan 20 21:32:12 linux-swrl mcelog[1010]:   Chipkill ECC syndrome = d131
Jan 20 21:32:12 linux-swrl mcelog[1010]:        bit40 = error found by scrub
Jan 20 21:32:12 linux-swrl mcelog[1010]:        bit46 = corrected ecc error
Jan 20 21:32:12 linux-swrl mcelog[1010]:        bit59 = misc error valid
Jan 20 21:32:12 linux-swrl mcelog[1010]:        bit62 = error overflow (multiple errors)
Jan 20 21:32:12 linux-swrl mcelog[1010]:   bus error 'local node response, request didn't time out
Jan 20 21:32:12 linux-swrl mcelog[1010]:              generic read mem transaction
Jan 20 21:32:12 linux-swrl mcelog[1010]:              memory access, level generic'
Jan 20 21:32:12 linux-swrl mcelog[1010]: STATUS dc18c100d1080a13 MCGSTATUS 0
Jan 20 21:32:12 linux-swrl mcelog[1010]: MCGCAP 105 APICID 0 SOCKETID 0
Jan 20 21:32:12 linux-swrl mcelog[1010]: CPUID Vendor AMD Family 15 Model 107

with the addresses 1091fbf0, 14e73230, 1f40bce0, 23175460, 27c7c560 and 2a7cb510 and messages for recoverable L3 cache errors like

Jan 20 22:35:32 linux-swrl kernel: [Hardware Error]: Corrected error, no action required.
Jan 20 22:35:33 linux-swrl kernel: [Hardware Error]: CPU:0 (f:6b:1) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc60c10011080a13
Jan 20 22:35:33 linux-swrl kernel: [Hardware Error]: Error Addr: 0x000000006eb55e90
Jan 20 22:35:33 linux-swrl kernel: [Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB.
Jan 20 22:35:34 linux-swrl kernel: EDAC MC0: 1 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x6eb55 offset:0xe90 grain:0 syndrome:0x11c1)
Jan 20 22:35:34 linux-swrl kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

Jan 20 22:35:35 linux-swrl kernel: [Hardware Error]: Corrected error, no action required.
Jan 20 22:35:35 linux-swrl kernel: [Hardware Error]: CPU:0 (f:6b:1) MC4_STATUS-|CE|MiscV|-|AddrV|CECC]: 0x9c18c100d1080a13
Jan 20 22:35:35 linux-swrl kernel: [Hardware Error]: Error Addr: 0x000000002a7cb510
Jan 20 22:35:35 linux-swrl kernel: [Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB.
Jan 20 22:35:35 linux-swrl kernel: EDAC MC0: 1 CE on mc#0csrow#0channel#0 (csrow:0 channel:0 page:0x2a7cb offset:0x510 grain:0 syndrome:0xd131)

Jan 20 22:35:39 linux-swrl kernel: [Hardware Error]: Corrected error, no action required.
Jan 20 22:35:39 linux-swrl kernel: [Hardware Error]: CPU:0 (f:6b:1) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc01410016080a13
Jan 20 22:35:39 linux-swrl kernel: [Hardware Error]: Error Addr: 0x0000000027c7c560
Jan 20 22:35:39 linux-swrl kernel: [Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB.
Jan 20 22:35:39 linux-swrl kernel: EDAC MC0: 1 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x27c7c offset:0x560 grain:0 syndrome:0x1602)
Jan 20 22:35:39 linux-swrl kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

These are not, however, my primary concern. When I looked at the journalctl output I saw that the journal daemon had crashed. There are multiple directories containing core.systemd-journal.0.* files, but they all have one of two names

core.systemd-journal.0.fec885af64924e25a7e095abc5e9f069.470.1518567876000000.xz
core.systemd-journal.0.fec885af64924e25a7e095abc5e9f069.11275.1518568511000000.xz

This is clearly what was keeping openSUSE from initializing normally.

Are there steps I should take before submitting a ticket? Thanks.

kerijan2003 · February 22, 2018, 6:37pm

Submitting a ticket???

dcurtisfra · February 23, 2018, 10:45am

You may also have an “enough free disk space” issue; you may have to clean up “/tmp/” and “/var/tmp/” plus, carefully, everything else in “/var/”.

With respect to the core dumps, with the user “root” “coredumpctl list” will list **** ALL **** the core dumps in the systemd journal. (Calling “coredumpctl list” with a ‘normal’ user, will only list those core dumps in the systemd journal related to that user.)
To purge the core dumps in the systemd journal, with the user “root” you’ll have “vacuum” the journal with the command “journalctl --vacuum-[OPTION]=<value>”.

shmuelmetz · March 1, 2018, 11:30pm

I have plenty of disk space. If I create a bug ticket, should I attach all of the dumps or only the most recent?

Feb 13 19:24:36 linux-swrl systemd-journald[11275]: Journal started

Feb 13 19:24:34 linux-swrl systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!

Feb 13 19:24:36 linux-swrl systemd[1]: Starting Flush Journal to Persistent Storage...

Feb 13 19:24:36 linux-swrl systemd[1]: Started Flush Journal to Persistent Storage.

Feb 13 19:24:36 linux-swrl systemd-coredump[11265]: Detected coredump of the journal daemon itself, diverted to /var/lib/systemd/coredump/core.systemd-journal.0.fec885af64924e25a7e095abc5e9f069.470.1518567876000000.xz.

knurpht · March 2, 2018, 1:12am

Nah, de-make-up cotton pads and alcohol. Erasers are to risky IME.