OpenSUSE 11.4/12.1: syslogd error on cache

Hi all!

Over the last few days my desktop (AMD Phenom II X2 550 BE @ stock speeds/settings) with OpenSUSE 11.4 (and since yesterday 12.1) started displaying errors out of nothing. The computer was running idle, I didn’t install updates, no changes in hardware, etc. The messages are like:

Message from syslogd@BartSUSE121 at Jan 4 21:56:23 ...
kernel: 300.701060] [Hardware Error]: Data Cache Error: during L1 linefill from L2.

Message from syslogd@BartSUSE121 at Jan 4 21:56:23 ...
kernel: 300.701068] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD

Message from syslogd@BartSUSE121 at Jan 4 21:56:23 ...
kernel: 300.701079] [Hardware Error]: MC1_STATUS[Over|CE|-|-|-]: 0xd000000000000171

Message from syslogd@BartSUSE121 at Jan 4 21:56:23 ...
kernel: 300.701085] [Hardware Error]: Instruction Cache Error: Copyback Parity/Victim error.

Message from syslogd@BartSUSE121 at Jan 4 21:56:23 ...
kernel: 300.701091] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: EV

These messages appear sometimes every minute or so, sometimes continuously. Besides this the system runs without problems, no crashes, not even when stressing the CPU with a multithreaded model. I had the problems under 11.4 and now also with 12.1. Windows7 (dual boot) runs also fine, computer survives half a hour of prime95 without problems.

From what I have found so far, these problems are mostly related to memory (?) or overheated processors. I ran memtest86 a few hours, no problems. The computer/CPU is quite cool with idle ~25deg, stressed ~30 so heat also shouldn’t be a problem.

Has anybody an idea where the problem might be or how to solve it? Hardware? OpenSUSE? …?

Thanks, Bart

On 01/05/2012 09:46 AM, julietbravo wrote:
>
> Hardware? OpenSUSE? …?

if i understand what you wrote correctly (that you saw these errors in
11.4 and still in 12.1) then i vote hardware and suspect disk failure
is in your near future…

i would suggest you not write to that disk again (which means do not
boot from it) until after you have a good backup of all the data on it
that you wish to keep (use a live CD to copy off important data, photos,
emails, movies, settings, configs, etc etc etc)…and then, most all the
suggestions given in this very fresh thread is good for you too:
http://forums.opensuse.org/showthread.php?t=470635


DD http://tinyurl.com/DD-Caveat
openSUSE®, the “German Engineered Automobiles” of operating systems!

Thanks for your fast reply. I indeed get the same errors on my old installation (11.4) and a fresh installation of 12.1 but both with the same HD.

I’ll install OSuse12.1 this evening on a spare HD to see if the problems still persist and let you know the results…

Couldn’t wait; quickly detached my old HD and installed OSuse12.1 on a spare HD but still the same errors.

Forget about this post. I see which hardware is concerned.

I don’t think it’s disk failure. Run memtest, and let it run for at least 24 hours, not just a few hours.

Ik zie opeens waar je vandaan komt :). Welkom !!. RAM kost bij ons bijna niets meer. Als je de machine nodig hebt, zou ik er ander RAM in plaatsen, en dan eens kijken. NL subforums ook gezien?

The messages refer to the caches of your CPU !!!. L1 and L2 are the Level-1 and Level-2 caches of your processor. Don’t know whether that means the CPU is failing, some BIOS setting, or a missing boot option for openSUSE. Gonna take a dive for that.

Thanks (bedankt :)) Knurpht…

I read on the Archlinux forum that this could be related to overheating of the CPU, but that is here clearly not the case… The fact that the problems popped up out of nothing (without changing anything!) doesn’t seem like good news to me, don’t hope that my CPU is baked :open_mouth: But, difficult to test this, don’t have a spare AM3 CPU laying around…

Just read the same post. And some bug reports. None found though that indicated a degrading condition of RAM or CPU. Take care of your backups :D. FWIW it might even be the motherboard that’s causing it, or the combo of motherboard and CPU (and RAM). If the system runs stable, also for a couple of days, I’d report a bug. If it is a bug, the messages might even be incorrect.

On 01/05/2012 01:16 PM, Knurpht wrote:
> The messages refer to the caches of your CPU !!!. L1 and L2 are the
> Level-1 and Level-2 caches of your processor.

oh!! good catch…i was FAR off base…

very happy you popped in and gave correcting advice!!

perhaps this would be a good time to see if this apparent problem is a
new result of the systemv to systemd change…

@JB, suggest you boot and on first green screen push F5, then select
systemv (rather than systemd which is the default for 12.1)–and, i
wouldn’t blame you at all for getting a second opinion prior to trying…


DD http://tinyurl.com/DD-Caveat
openSUSE®, the “German Engineered Automobiles” of operating systems!

I think there was one case where it was related to a problematic CPU, but he could at least reproduce it on Windows7.
All my code & data is safely stored using/on git/Huygens, so at least there I’m safe… Is there an easy method to temporarily disable these messages? It’s a bit difficult to use the system now with the constant stream of error messages.

I now have a clean system running on a spare HD so I can safely experiment there, I’ll give it a try!

Messages were already there on 11.4, where systemd was not installed by default.

I’m starting to get the feeling that it is indeed a hardware problem; problems in OSuse persist and for the last few days I’ve been running Ubuntu (sorry…) which keeps crashing constantly after which half of the time the computer gets stuck at the BIOS-screen. Guess I’ll be running some more hardware checks over the next few days :X

I hope you don’t run into my situation from a couple of months ago. First a broken video card (BIOS beep codes), replaced the video card, then -what seemed to be- RAM errors, replaced RAM. In both situations replacement seemed the solution in the first couple of days. Then the machine appeared to have lost a disk (used for temparary backups. In the end it was the motherboard…videocard is back in use, RAM as well (thoug not in my machine :)).

A couple of things you still can do:

  • Disassemble machine completely. Clean all connectors and reassemble.
  • Check BIOS battery, or rather replace it by a new one.
  • Have memtest run for at least 24 hours.

Your story sounds pretty familiar; after one of the latest crashes one HD is no longer visible, not even in the BIOS :open_mouth: So, might as well be a problem with the motherboard. The main problem for now is that I’m not looking forward to buying a new {CPU / motherboard}, only to find out that the problem is actually within the {motherboard / CPU}. I’ve I’m going to buy both of them, I strongly prefer an Intel CPU with with corresponding new motherboard.

I’ve already disassembled the machine and cleaned it from top to bottom, but nothing was seriously ‘dirty’ or dusty.
I’ll replace the BIOS battery to see if that helps.
I’ve been running memtest for 14 hours without problems, will try 24h as well.

I’ll see if I can borrow a spare motherboard + CPU here at the university to ensure that the problem is related to my current motherboard and/or CPU.

Me too! I am having the same error messages. I also know that my RAID Controller’s BIOS has been messed up. I got this server with failed drives and whenever I would boot it would say “3ware BIOS not installed”. 3ware is my RAID Controller. I popped in new drives and the the controller recognized both drives but stated that one was “Not in use”. It told me that my controller had a Degraded Disk Array and that I should rebuild the array. Could these problems be related? I mean I’m seeing disk failure, RAID controller failure crazy error messages from the OS? I am going to try to rebuild the array, it says to try with the HD that is in the “not in use” slot and then try with another drive if it does not work. I will try to post a picture of the message. But yes, DEFINITELY BACK UP OUR DATA FOLKS!

It probably is.

Solution:

Disable the defective core via BIOS.
Look at the MCn_STATUS number.
0 is the first, 1 the second etc.
On my PC it was also 1, meaning the second processor.

Problem was caused by aged and cracked heat conductivity paste, which became ineffective.
After replacement of the paste and disabling the defective core, the system runs smoothly again, just a bit slower ofc.

Just to confirm that it was the defective processor core I add:
Had the problem on 11.4, and after install of 12.2 the problem reappeared after re-activating core 1.
So it’s really a hardware problem.

Not a big issue, now it’s a X3 processor, no longer a X4…
Probably no need to replace the processor, as defective cores are disabled in the factory, too.
In fact, if you have less than 6 cores you can expect a few defective cores buried in your CPU as the CPU then gets marketed as X2, X4 and the like.