Frequent 'Dazed and confused' in logs - Possible hard disk problem?

Hi,

I have opensuse 12.2 running on my server. The server uses 2 x 1TB sata seagate drives under fake raid. (silicon image driver)

Im getting the following in the logs. Is it serious…and if so how should i go about fixing it.

May 17 14:59:56 master kernel: [672051.182115] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
May 17 14:59:56 master kernel: [672051.182121] ata3.00: BMDMA2 stat 0x80e0009
May 17 14:59:56 master kernel: [672051.182126] ata3.00: failed command: READ DMA EXT
May 17 14:59:56 master kernel: [672051.182134] ata3.00: cmd 25/00:00:40:d7:78/00:01:31:00:00/e0 tag 0 dma 131072 in
May 17 14:59:56 master kernel: [672051.182136]          res 50/00:00:3f:d8:78/00:00:31:00:00/f0 Emask 0x20 (host bus error)
May 17 14:59:56 master kernel: [672051.182140] ata3.00: status: { DRDY }
May 17 14:59:56 master kernel: [672051.183005] [sched_delayed] sched: RT throttling activated
May 17 14:59:56 master kernel: [672049.863006] NMI: PCI system error (SERR) for reason b0 on CPU 0.
May 17 14:59:56 master kernel: [672049.863006] Dazed and confused, but trying to continue

Cheers

Nigel

The “Dazed and confused, but trying to continue” NMI error indicates a hardware problem, but its hard to say if it is due to a hard drive problem. Doing a search for the issues finds bad video cards, bad memory and even a bad CPU. Perhaps a hard drive as well, but it was not conclusive. I don’t know how old the PC is, but I would do a complete cleaning, get rid of all dust in heat sinks, power supply and fans, check all cooling fan operation, plug and unplug each connector, and memory modules which will redo all mechanical connections to reduce resistance on each one. The older the PC, the bigger such problems can be. If after such an effort, the error continues, some how you need to eliminate the hard drive, perhaps memory and even down to the motherboard and CPU. I must admit to have just purchased a new motherboard, CPU and memory and popped it into the existing system before when the problem could not be determined.

Thank You,

Thanks James,

The server is about 7 years old. It went down earlier in the year…i suspect the cause was a failed raid card, which I replaced. I built the os fresh and it has been running ‘ok’ since February…just these log messages. I will do the clean as recommended as the machine was very dusty. I will let yo know if that cures it.

cheers

Nigel

On 05/18/2013 05:06 AM, Bignige wrote:

> The server is about 7 years old. It went down earlier in the year…i
> suspect the cause was a failed raid card, which I replaced. I built the
> os fresh and it has been running ‘ok’ since February…just these log
> messages. I will do the clean as recommended as the machine was very
> dusty. I will let yo know if that cures it.

Cleaning the machine is a good step, but I am not sure it will help. I have had
those kinds of errors due to problems with the SATA cables. You might try
replugging them, but they may have reached their end of life. Apparently, most
such cables are only good for a limited number of plug/unplug cycles. You may
need new ones.

[QUOTE
Cleaning the machine is a good step, but I am not sure it will help. I have had
those kinds of errors due to problems with the SATA cables. You might try
replugging them, but they may have reached their end of life. Apparently, most
such cables are only good for a limited number of plug/unplug cycles. You may
need new ones.[/QUOTE]

I have now cleaned the machine, re-seated the memory, raid card, sata cables and all other cables. It seemed that the messages had ceased but today while running a file sync with unison, I had more errors in the logs:-


May 23 19:27:16 master kernel: [94534.230017] NMI: PCI system error (SERR) for reason b1 on CPU 0.
May 23 19:27:16 master kernel: [94534.230017] Dazed and confused, but trying to continue
May 23 19:27:16 master kernel: [94535.549055] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
May 23 19:27:16 master kernel: [94535.549060] ata3.00: BMDMA2 stat 0x80e0009
May 23 19:27:16 master kernel: [94535.549065] ata3.00: failed command: READ DMA EXT
May 23 19:27:16 master kernel: [94535.549073] ata3.00: cmd 25/00:00:68:4a:75/00:01:62:00:00/e0 tag 0 dma 131072 in
May 23 19:27:16 master kernel: [94535.549075]          res 50/00:00:67:4b:75/00:00:62:00:00/f0 Emask 0x20 (host bus error)
May 23 19:27:16 master kernel: [94535.549079] ata3.00: status: { DRDY }
May 23 19:27:16 master kernel: [94535.558609] ata3.00: configured for UDMA/100
May 23 19:27:16 master kernel: [94535.558636] ata3: EH complete
May 23 19:27:16 master kernel: [94535.559765] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
May 23 19:27:16 master kernel: [94535.559770] ata3.00: BMDMA2 stat 0x10e0009
May 23 19:27:16 master kernel: [94535.559775] ata3.00: failed command: READ DMA EXT
May 23 19:27:16 master kernel: [94535.559783] ata3.00: cmd 25/00:00:68:4a:75/00:01:62:00:00/e0 tag 0 dma 131072 in
May 23 19:27:16 master kernel: [94535.559785]          res 50/00:00:67:4b:75/00:00:62:00:00/f0 Emask 0x20 (host bus error)
May 23 19:27:16 master kernel: [94535.559788] ata3.00: status: { DRDY }
May 23 19:27:16 master kernel: [94535.568493] ata3.00: configured for UDMA/100
May 23 19:27:16 master kernel: [94535.568520] ata3: EH complete
May 23 19:28:15 master kernel: [94595.126479] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
May 23 19:28:15 master kernel: [94595.126485] ata3.00: BMDMA2 stat 0x80e0009
May 23 19:28:15 master kernel: [94595.126490] ata3.00: failed command: READ DMA EXT
May 23 19:28:15 master kernel: [94595.126498] ata3.00: cmd 25/00:00:e0:c3:65/00:01:3c:00:00/e0 tag 0 dma 131072 in
May 23 19:28:15 master kernel: [94595.126500]          res 50/00:00:df:c4:65/00:00:3c:00:00/f0 Emask 0x20 (host bus error)
May 23 19:28:15 master kernel: [94595.126504] ata3.00: status: { DRDY }
May 23 19:28:15 master kernel: [94595.135480] ata3.00: configured for UDMA/100
May 23 19:28:15 master kernel: [94595.135508] ata3: EH complete

Could there be a problem with the hard drive configuration?..I found this in dmesg " ata5.00: limited to UDMA/33 due to 40-wire cable" but it looks like this is related to a different hard drive.

I will change the sata cables to see if that helps. if its not the cables then is it likely to be the sata raid card or motherboard etc?

p.s. the drives in question are “hot Swap” drives.

Nigel

I changed the sata cables and moved the drives to different bays but the problem persists.

So, you must think that either there is a hard drive problem or a disk controller problem on the motherboard. If you had an option to use a non-raid hard drive to see if it gets errors, that might be one thing to try. I do wonder is using a newer kernel might also tell you something?

openSUSE and Installing New Linux Kernel Versions - Blogs - openSUSE Forums

Thank You,

The hard drives and controller are new…the motherboard is old. I have a gut feeling that it may be a bug in the kernel or at least a rogue error related to the way the controller is talking to the rest of the system. …yes its a vague conclusion with no evidence but I feel the hardware is sound.

The server is a live production system so its hard to test using a non raid setup. Would upgrading the kernel be a major job? I have not done that before.

Thanks

Nigel

On 2013-05-24, Bignige <Bignige@no-mx.forums.opensuse.org> wrote:
> The server is a live production system so its hard to test using a non
> raid setup. Would upgrading the kernel be a major job? I have not done
> that before.

The binary way: http://download.opensuse.org/repositories/Kernel:/stable/standard/
Jim’s way: http://forums.opensuse.org/blogs/jdmcdaniel3/s-k-c-suse-automated-kernel-compiler-version-2-78-34/
The manual way: https://www.kernel.org/

Thanks. I will try a later kernel and report back.

well I upgraded the kernel from 3.4.6-2.10-desktop to 3.4.33-2.24-desktop. The dazed and confused errors have ceased and there are no other worrying errors in the logs. So I guess that 3.4.6-2.10 either has a bug or it just doesnt like my hardware.

problem solved :slight_smile:

Thank for your help guys.

Nigel

So it is a curious thing for an apparent hardware problem to be associated with a bum kernel, but of course, anything can have bugs, that is for sure. Happy to hear of your success through kernel updates. For more info on kernel upgrades, have a look at my blog on the subject you can find here:

openSUSE and Installing New Linux Kernel Versions - Blogs - openSUSE Forums

Thank You,