find bad DIMM in a 32-DIMM box?

Dear Opensuse users: I am looking for some advice… Does anyone have experience with finding a single bad DIMM? I have 32 DIMM’s in my 4-socked Magny-Cours Opteron computer, and am getting memory errors about once a day. They are always corrected, but it’s been getting worse, so I have to locate and replace that DIMM before I get crashes. (Kingston memort, OS 12.3x64) Is there a good method under Opensuse or is memtest86 useful for this?

Thank You,
patti

Edit: This is the type of console notification I receive:

Message from syslogd@OS121-TY3 at Oct 12 20:41:56 ...
 kernel:[283394.737807] [Hardware Error]: CPU:30        MC4_STATUS-|CE|MiscV|-|AddrV|CECC]: 0x9c0441005d080a13

Message from syslogd@OS121-TY3 at Oct 12 20:41:56 ...
 kernel:[283394.737820] [Hardware Error]:       MC4_ADDR: 0x00000029fe928820

Message from syslogd@OS121-TY3 at Oct 12 20:41:56 ...
 kernel:[283394.737824] [Hardware Error]: Northbridge Error (node 5): DRAM ECC error detected on the NB.

Message from syslogd@OS121-TY3 at Oct 12 20:41:56 ...
 kernel:[283394.737845] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

Hi
Disable the ECC correction in your BIOS and run memtest on it… You need to disable else it will fix the error since it’s ECC ram…

Thanks, Malcom - I didn’t think to disable ECC. I guess I’ll give that a shot.

Patti

PattiMichelle wrote:

>> Disable the ECC correction in your BIOS and run memtest on it… You need
>> to disable else it will fix the error since it’s ECC ram…
>
> Thanks, Malcom - I didn’t think to disable ECC. I guess I’ll give that
> a shot.
>

One trick that might help with that intermittent failure is to turn a hair
drier on the memory bank(s) while running memtest. I was always running
into a problem where it took a relatively long time before the error would
begin to show up as the whole thing cooled down every time I shut down to
pull/change a module. Just get it noticably warmer than usual - no need to
desolder the chips :wink:

Another trick is to see if you can get the failure with only 4gb of RAM
plugged in - memtest can take a loooong time to run with the whole 32gb in
the array!

Thanks for the suggestions. I turned ECC off in BIOS. I have 256GB (four banks of 64GB, one bank per socket) on this Magny-Cours machine. I tried two versions of memtest (including 5, which wouldn’t work) but the opensuse 12.3 default memtest seems to be running. If I tell it to probe the memory, it quickly throws errors and then screens full of garbage. It’s running a test right now, but I don’t really think memtest supports 48CPUs - so it’s not going to help me locate the bad memory stick, will it?

Hah! I remember when I kept an old '386 computer running by blowing an inverted aeroduster can (cold freeze) onto a memory stick to keep it cool. Unfortunately, this supermicro 2U mobo has a plastic lid on passive finned heatsinks which use a rack of case fans for cooling, it won’t give me access the individual memory sticks while the system is on.

It would help to know if memtest is actually testing everything… It seems to cycle only through 64GB (even though it displays maxmem = 256G) - maybe it’s only testing one CPU’s bank of memory? Maybe it’s not doing all 4 sockets? Guess I’ll find out in a few hours.

Patti

Ok, Memtest v5 seems to be testing all memory, even though it won’t do SMP mode… (I think it only supports 2 sockets in SMP)