Page 1 of 3 123 LastLast
Results 1 to 10 of 24

Thread: ECC errors, only for Linux

  1. #1
    Join Date
    Jan 2018
    Location
    Annandale, VA
    Posts
    234

    Default ECC errors, only for Linux

    I have an AMD Athlon dual core desktop that I primarily use to run OS/2. Neither FreeDOS, OS/2 nor the BIOS POST memory test report any memory problem. I was able to install both 11.3 and LEAP 42.3 on this machine without any problems during installation. However, LEAP on my hard drive is now reporting ECC errors; they same thing had happened with the 11.3 install. I'm also getting messages about write errors on fd0; there's nothing in the floppy drive, nor should there be.

    Could Linux be doing something that stresses the memory more than other systems? How can I capture the log in order to post the exact error text? Thanks.

  2. #2

    Default Re: ECC errors, only for Linux

    I think a specialized testing tool like memtest86+ or memtest86 will yield better results than guesswork on Linux RAM usage. Get a bootable medium with memtest on it and let it run. The newer versions do support ECC RAM and should flag ECC errors during the stress test.

    Memtest used to be a boot option on Opensuse install discs but I am not sure if that still is the case. Just boot the install medium and check if it is still shipped. If not, download a bootable medium.

  3. #3
    Join Date
    Jan 2018
    Location
    Annandale, VA
    Posts
    234

    Default Re: ECC errors, only for Linux

    I installed memtest86+, which puts a memtest option on the grub menu. When last I looked it had run 14 hours without an error. Every time I looked it said that it was running with SMP disabled and was using only core 0.

  4. #4
    Join Date
    Feb 2016
    Location
    Berlin
    Posts
    357

    Default Re: ECC errors, only for Linux

    perhaps more effective if you post the exact message from the journal

  5. #5
    Join Date
    Sep 2008
    Posts
    2,997

    Default Re: ECC errors, only for Linux

    memtest86 tests RAM it does not test hard drives for ecc errors for that I'd recommend checking your hard drives SMART status using a tool like smartctl
    smartctl is packaged in smartmontools so you'd need to install that package first
    Code:
    zypper in smartmontools
    if SMART is on you can get information about a particular drive by running (assuming /dev/sda is the hdd reporting ecc errors if not replace /dev/sda with the appropriate device)
    Code:
    smartctl --info /dev/sda
    you can run a test on the drive that's been reporting issues
    Code:
    smartctl -c /dev/sda

    hard drives do tend to die off and usually give a lot of crc and ecc errors near the end of their life (I've had 3 drives die on me)
    if your hard drive is dying usually you'll see a large Reallocated Sectors Count in SMART
    if your hard disk is dying your only choice is to backup your data to a working device (a different hdd or a usb/dvd device)

  6. #6
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    4,567

    Cool Re: ECC errors, only for Linux

    @shmuelmetz:
    I don't really want to add to I_A's reply but, I will:
    • For a summary view of a drive's health, you should be using (with the user "root") "smartctl --health /dev/sda" or "/dev/sdb" or "/dev/sdc/" and so on.
    • For a complete view of a drive's health, you should be using "smartctl --health --all /dev/sda".
    • Ditto: "smartctl --info --all /dev/sda".
    • Ditto: "smartctl --capabilities --all /dev/sda" or "smartctl -c -a /dev/sda".

  7. #7
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    4,567

    Default Re: ECC errors, only for Linux

    Quote Originally Posted by shmuelmetz View Post
    I installed memtest86+, which puts a memtest option on the grub menu. When last I looked it had run 14 hours without an error. Every time I looked it said that it was running with SMP disabled and was using only core 0.
    Which makes sense -- a memory test application which uses SMP is conceivable but possibly quite complicated …

    I would suggest that, the journal entries related to ECC memory messages be checked:
    Code:
     # journalctl | grep -Ev 'SECCOMP|ECC_Uncorr_Error_Count|Hardware_ECC_Recovered' | grep 'ECC'

  8. #8
    Join Date
    Jan 2018
    Location
    Annandale, VA
    Posts
    234

    Default Re: ECC errors, only for Linux

    The memtest ran for 24 hours, also I'm concerned as to whether test on a single core are good enough. I will extract the journal data with

    journalctl | grep -Ev 'SECCOMP|ECC_Uncorr_Error_Count|Hardware_ECC_Recovered' | grep 'ECC'

    Do I also need to look for cache errors?

    BTW, how do I enable HTML tags so I can put things in a code block? Thanks.

  9. #9
    Join Date
    Feb 2010
    Location
    Germany
    Posts
    4,567

    Default Re: ECC errors, only for Linux

    Quote Originally Posted by shmuelmetz View Post
    Do I also need to look for cache errors?
    The 'grep' filter I've proposed should pull out any memory ECC errors -- the "-Ev" filter only filters out any disk related ECC messages.

    Quote Originally Posted by shmuelmetz View Post
    BTW, how do I enable HTML tags so I can put things in a code block? Thanks.
    Assuming that, you're using the Web-Browser interface to this Forum, and not the News feed, in the middle row of the rich-text formatting buttons, the button with a '#' will wrap [CODE] tags around the selected test -- floating the mouse over the buttons will pop-up help text balloons.

  10. #10
    Join Date
    Jan 2018
    Location
    Annandale, VA
    Posts
    234

    Default Re: ECC errors, only for Linux

    I have not been getting messages about hard drive errors, only cache and DRAM.

Page 1 of 3 123 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •