advanced technical question - system freeze - diagnose hardware - memtest

I’ve been getting some system freezes when running blender on a new system (using about 1/2 my 8GB memory, AMD Phenom II x6 1090T+default AMD CPU cooler, ECS A880LM-M motherboard, Corsair 1333 DDR3, 2x4GB). PS voltages (measured off MOLEX plug) 12.05VDC and 5.2VDC. Tried BIOS resetting defaults.

I’m trying to narrow down the problem with the MEMTEST v4.0. It gets through all tests fine until test #7 (block move) and starts throwing a ton of errors, and about 2/3 of the way through that test it says “unexpected interrupt, halting CPU0.” All the listings say “28dbbd11” (this varies from run to run, except the ‘Stack:’ listing seems to count up from that number).

I’ve tried unplugging one memory module and then swapping them as well as their positions. Same result always: block moves throw errors, followed by halting CPU0. To me this suggests the motherboard or the Phenom II. But I don’t know how to test these. Can someone give the next step in diagnosis? I did find this link, but I’m not sure how it differentiates between my error and something specific to either the MOBO or CPU:
http://www.sharkyforums.com/showthread.php?p=2737270

THANKS!
PattiMichelle

It is frustrating to get such a problem. Let me say that the issue that can cause this is bad or incompatible memory. A new motherboard, from AUSU or GIGABYTE is another choice and perhaps less costly than buying new memory. For memory incompatibility, a BIOS update could fix the issue OR, with memory errors, brick the motherboard, which is bad. While anything is possible, it is doubtful the CPU is bad. I might check for compatible memory, as listed by ECS and consider buying it. If purchasing new memory is out of the question, consider a BIOS update if a newer one exists, then reset the motherboard BIOS to its safest or most compatible settings (something you could even try right now). In the end, you got to take some action. The testing says memory, plain and simple, but incompatible memory is always possible.

Thank You,

Thank you very much for your time and expertise! I ran IntelBurnTest for some time with no problems (under Vistax64) - so it’s likely as you way, a inadequately designed MOBO. I checked the memory specs as far as I could (DDR3 1333MHz) and all seems well… I suppose I’ll try a bios update, then get an ASUS motherboard. I did notice that the manual mentions:

Supports DDR3 1600(OC)/1333/1066 DDR SDRAM with dual-channel architecture
Accommodates two unbuffered DIMMs

I have:
Corsair CMX4GX3M1A1333C9 XMS3 4GB DDR3 RAM. This Corsair CMX4GX3M1A1333C9 XMS3 4GB DDR3 RAM runs at 1333MHz at CAS latency of 9-9-9-24

I noticed the BIOS reported this CAS latency correctly. I was thinking that I may need “dual channel” chips rather than two of these (don’t really understand this part) - but it didn’t matter whether I used one or both of these chips when I ran MEMTEST, and it didn’t matter whether I ran them “ganged” (128bit) or “unganged” (64bit) as set in BIOS.

Could this be a MEMTEST problem? (i.e., a Win memory-access code vs Linux memory-access code?)

Thank you very much for your time and expertise! I ran IntelBurnTest for some time with no problems (under Vistax64) - so it’s likely as you way, a inadequately designed MOBO. I checked the memory specs as far as I could (DDR3 1333MHz) and all seems well… I suppose I’ll try a bios update, then get an ASUS motherboard. I did notice that the manual mentions:

Supports DDR3 1600(OC)/1333/1066 DDR SDRAM with dual-channel architecture
Accommodates two unbuffered DIMMs

I have:
Corsair CMX4GX3M1A1333C9 XMS3 4GB DDR3 RAM. This Corsair CMX4GX3M1A1333C9 XMS3 4GB DDR3 RAM runs at 1333MHz at CAS latency of 9-9-9-24

I noticed the BIOS reported this CAS latency correctly. I was thinking that I may need “dual channel” chips rather than two of these (don’t really understand this part) - but it didn’t matter whether I used one or both of these chips when I ran MEMTEST, and it didn’t matter whether I ran them “ganged” (128bit) or “unganged” (64bit) as set in BIOS.

Could this be a MEMTEST problem? (i.e., a Win memory-access code vs Linux memory-access code?)

Dual channel refers to how the motherboard chipset addresses the memory and determines if memory must be installed in pairs of threes, such as on Intel i7 9xx CPU chipsets. Dual channel memory would be cheaper than needing to buy three memory modules at a time. Most motherboards are dual channel, which is more popular. You need to look at the right type like DDR2 (older) or DDR3 (newer) and the speed like 1600 or 1333 mhz and try to pair like memory together. As for ECS motherboards, I have used more than one, most worked OK, but a few had odd issues and more than anything else, though MSI might be just slightly behind in the bad department. Any company can produce a defective product, but generally I have had the best luck with ASUS and Gigabyte. I think ASUS is generally faster while Gigabyte might be more reliable, but a lot depends on the motherboard chipset it would seem. As for MEMTEST, I doubt it is a problem unless it is the only program that thinks there is a problem and even then one would wonder if there might indeed be an issue there.

Thank You,

There is nothing OS specific about memtest. It’s a standalone program that runs without an OS, hence the need to reboot when it’s terminated.

Thank you again - a little clarification - I bought two DDR3 1333Mhz chips (rather than a pair sold as “dual channel”). I’ve been unable to find a definitive statement that you need to buy memory in pairs with part numbers that refer to dual channel. The MOBO website and manual are **very **light on details here.

(But I guess this is actually beside the point for my problem since the same error occurs when the chips are plugged in individually or as a pair…)

Thank you again - a little clarification - I bought two DDR3 1333Mhz chips (rather than a pair sold as “dual channel”). I’ve been unable to find a definitive statement that you need to buy memory in pairs with part numbers that refer to dual channel. The MOBO website and manual are **very **light on details here.

(But I guess this is actually beside the point for my problem since the same error occurs when the chips are plugged in individually or as a pair…)

Again, dual channel refers to how the motherboard chipset uses the memory and requires that memory be installed in pairs of memory modules. Memory, of course, can carry a dual channel rating as to have been tested for use in that manner and found to work thus sold as, dual channel memory. In the end, you don’t know for sure if its memory related or motherboard related. Since you can only replace one or the other, since they are not repairable, you have to make a choice how to over come the problem. This is one case where a computer shop can help. Even at our local Fry’s, for a mere $25 and the purchase of new memory, one can have it tried out with an existing motherboard to see if it will work. If not, one could exchange the memory for a new motherboard and try again, I guess for an added $25 to see if the motherboard where bad. Since it is just a matter of time before I could use the new memory in a different computer, I often just buy more memory, though I might upgrade from 133 mhz to 1600 mhz at the same time. If the new memory works no better, it is time for a new motherboard.

Thank You,

Thank you - I was not sure if “dual channel” memory was physically different from “single channel” and hence not interchangeable - but there appears from your comments to be little real difference. I think I’m going to start with a new Gigabyte motherboard.

Thank you - I was not sure if “dual channel” memory was physically different from “single channel” and hence not interchangeable - but there appears from your comments to be little real difference. I think I’m going to start with a new Gigabyte motherboard.

I wish you good luck and ask that you return and tell us of your success. Gigabyte is a good choice and very stable, that is for sure. I went back through the documents for my last eight motherboards I have owned and Gigabyte wins 5 to 3 (two were ASUS and one MSI). All Gigabyte’s have worked like a champ for me though I admit to upgrading often and selling what I don’t use anymore. But on the other hand, I have had a good sampling I think to say Gigabyte works well. My Number one fastest PC is an ASUS, but the Intel P67 chipset is most likely the reason.

Thank You,

it’s very difficult to diagnose these kinds of problems, but in my experience the most likely culprit is heat… the cpu is overheating and leading to a checksum/parity error…which brings everything to a halt.

before i did anything drastic, i’d try to reduce the cpu frequency say… 10-15% and try the same tests… this would be a “go-nogo” indicator, if you installed lmsensors and monitored the cpu temp while crunching with blender, you could get some idea if the cpu is overheating.

maybe easy as breaking the thermal contact area between cpu and heatsink and performing maintance (clean and re-apply thermal paste).

libsensors … my error… it’s been a long day. :slight_smile:

This is excellent advice, I’d certainly try this before changing the motherboard.

Regarding dual-channel, AFAIK there is no special requirement for that except that the memory cards should have the same speed and timings, and some memory controllers check if the memory cards are identical to guarantee performance, while rare others even work with different chips (different speed/timings) but at the lowest common denominator. In practice it became habitual to recommend two identical memory cards, same manufacturer and all, and that’s what most mobos expect today.
I think I read somewhere about triple-channel memory controllers (three identical memory cards), but I’m not sure if it already exists or is just some technology report.

I think I’ve seen triple channel too - I’m not sure it’s not sales-hype from tech stores. I bought a higher-end CPU cooler and will try that before swapping the MOBO (waiting for it to arrive). The interesting thing (to me) is that memtest86 always fails exactly at the start of the “block move” portion of the first test pass (I think that’s test #7). Heat tends to be more random than that. Still, one is often surprised. That’s why I was concerned about the memory chips (dual vs single channel) but on the other hand, it doesn’t matter whether I have one or two chips installed, “block move” always throws errors and dies. Hopefully it’s simply a badly-designed mobo (instead of a fried CPU).

I’ll check out libsensors. Thanks!!!

It just occurred to me that what’s really, truly weird about all this is that it’s been happily crunching away for days now on smaller simulations (using less than, say 1/3 of available RAM). What got me started with MEMTEST was that large simulations (say 2/3’s of RAM) would hang the system. This sort of goes along with the observation that Block Move portion of the memory test failed, doesn’t it? So it’s not exactly like my system’s flaky, it’s more like it cannot seem to do some things that I think it ought to be able to do (does that make sense?). Isn’t memory supposed to be ‘flat’?

You may have some addresses in RAM that are “bad”. Not necessarily that the memory cell is bad, but that the system (CPU, memory controller, RAM) cannot use those locations reliably. Your small jobs might not touch the bad addresses.

For example I have a motherboard where the first SIMM slot is bad. Memory in that slot is good only up to a certain address and then it starts spewing errors in memtest. The system can boot an OS, but at some point the OS starts doing strange things.

But that wouldn’t happen after the OP swapped memories/tried each in turn/tested new memories. I’d guess a memory controller problem, this things may occur only in some hardware (chipset) type or configuration. Now, if the processor has an embedded controller it may cause the problem with this specific m/b but not with another. This may even lead to a recall or a refund.

Software memory testers are not very reliable IMO and as @ken_yap said, the fact that it fails to read an address doesn’t mean that the module is bad. If you have doubts about your RAM, you should test it on another mainboard or bring it to someone who has a hardware RAM tester. He will tell you within minutes if the RAM is bad. Also don’t tell us that your BIOS has a memory remap feature which is disabled!

Except blender and the strange results you get from MEMTEST, what else doesn’t work on this machine. I suggest running conky on your desktop to monitor everything, including all CPUs and memory usage, top CPU an memory processes, CPU an mainboard temperature.
Good application for reviewing and monitoring/overclocking system?

brunomcl wrote:
> I think I read somewhere about triple-channel memory controllers (three
> identical memory cards), but I’m not sure if it already exists or is
> just some technology report.

Yes it exists in some high-end Intel chipsets. Works as advertised on
Xeons at least.

In my BIOS I saw a setting for allowing memory “holes” - I got the impression it’s possible to map around such bad memory? If that were the case, then I guess you would have to let the BIOS do a full memory test at each boot.

I think the memory remap is enabled - I’m not sure exactly what that is now that I think about it, though it sounds like a HDD’s ability to take bad sectors out of usage… wouldn’t memtest respect that? (I assume I would have to let the full BIOS memory test run at startup to have that work)

The only problem I have is when blender tries to use a lot of memory (say, over ~50% of the available 8GB), and, of course, the memtest block move failure. Everything else seems fine under Linux and Win2k3Server - it was just the system hangs on these big blender runz that got me to try memtest anyway. Blender can truly eat memory. One of the reasons I went ahead and got another motherboard (earlier in this thread) was so I could go beyond 8GB (current board has only 2 slots / 8GB max).

Now that I remember, there are two memtest errors found - the block move errors reported - but then after throwing a bunch of those, it gets an “unexpected interrupt - shutting down CPU0” error and everything freezes… this is apparently my ‘system hang’ when using blender… that “unexpected interrupt.”