I’m using SMART to monitor the HDDs and I’m getting a a series of messages
from one of the drives on a RAID-5
<snip>
Nov 5 12:52:31 zanshin smartd[4814]: Device: /dev/sdb, 9 Currently
unreadable (pending) sectors
Nov 5 12:52:31 zanshin smartd[4814]: Device: /dev/sdb, 8 Offline
uncorrectable sectors
</snip>
Does this mean the HDD is on the way out or can I take everything off-line,
re-partition the disk, fsck etc. and flag the offending sectors as bad and
continue using it?
In theory the disks are still under warranty - do SMART messages like this
constitute a failed disk?
Not necessarily. With 10.3, I used to get tons of these messages on both of my systems, until I turned off the notifications. I have since upgraded to 11, and both hard drives are still going strong, and I no longer get these notifications.
Note though, I was not using RAID, so your situation may be different.
I would maybe use another hard drive diagnostic program, even one in Windows, to see if it says the same thing or not. But I would not panic just yet.
> Hi
> Run the raid tool mdadm --detail /dev/md(n) where n is the number.
>
About 10 minutes after the original post I had problems with the RAID - I’ve
run mdadm --detail and I’ve got /dev/sde1 (4th disk in the 4 disk array,
which has never reported a problem with smartd) out and rebuilding -
currently at 25%. I’ll just keep monitoring I suppose.
This a.m. I backed up everything, rebuilt the RAID and restored hoping I’d
get rid of the messages. Maybe I should have fsck’d the fs in the RAID
before rebuilding :-/
> Hi
> Run the raid tool mdadm --detail /dev/md(n) where n is the number.
>
About 10 minutes after the original post I had problems with the RAID -
I’ve run mdadm --detail and I’ve got /dev/sde1 (4th disk in the 4 disk
array, which has never reported a problem with smartd) out and
rebuilding - currently at 25%. I’ll just keep monitoring I suppose.
This a.m. I backed up everything, rebuilt the RAID and restored hoping
I’d get rid of the messages. Maybe I should have fsck’d the fs in the
RAID before rebuilding :-/
Alan
[/QUOTE]
Hi
How are the hard drive temperatures? Do you have hddtemp installed?
–
Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.18-0.2-default
up 15:48, 2 users, load average: 0.01, 0.19, 0.17
GPU GeForce 6600 TE/6200 TE - Driver Version: 177.80
I have had similar errors in both sata and other drives. Most of the errors go back to a failing power supply that I changed out once I figured out the problem. After more than a few months and upgrades to Suse 11 and other linux distros I still receive the same Smart errors, but they never get worse. Not really sure if the hard drive keeps track of it also, but my tendency would be ignore it unless it gets worse, rather than make radical changes. Just an additional thought. Since the fault occurred I have reformatted the disks and still see the same errors, but nothing is creating problems for me.
>
> I have had similar errors in both sata and other drives. Most of the
> errors go back to a failing power supply that I changed out once I
> figured out the problem. After more than a few months and upgrades to
> Suse 11 and other linux distros I still receive the same Smart errors,
> but they never get worse. Not really sure if the hard drive keeps track
> of it also, but my tendency would be ignore it unless it gets worse,
> rather than make radical changes. Just an additional thought. Since the
> fault occurred I have reformatted the disks and still see the same
> errors, but nothing is creating problems for me.
>
> Dave
> VE3IXI
>
>
I took down the offending disk (/dev/sdb) today, deleted the single
partition and recreated and formatted it as ext3 with no errors. I’ve added
it back to the RAID and it’s now all together again but I still get the
smartd messages.
It looks like there’s no cure save physically replacing the disk - unless
anyone’s got any getter ideas for what to do with a disk once it’s removed
from the RAID.
>> Hi
>> Run the raid tool mdadm --detail /dev/md(n) where n is the number.
>>
>
> About 10 minutes after the original post I had problems with the RAID -
> I’ve run mdadm --detail and I’ve got /dev/sde1 (4th disk in the 4 disk
> array, which has never reported a problem with smartd) out and
> rebuilding - currently at 25%. I’ll just keep monitoring I suppose.
>
> This a.m. I backed up everything, rebuilt the RAID and restored hoping
> I’d get rid of the messages. Maybe I should have fsck’d the fs in the
> RAID before rebuilding :-/
>
> Alan
>
> [/QUOTE]
> Hi
> How are the hard drive temperatures? Do you have hddtemp installed?
>
I don’t have hddtemp installed but smartd reports
SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 113
>> Hi
>> Run the raid tool mdadm --detail /dev/md(n) where n is the number.
>>
>
> About 10 minutes after the original post I had problems with the RAID
> - I’ve run mdadm --detail and I’ve got /dev/sde1 (4th disk in the 4
> disk array, which has never reported a problem with smartd) out and
> rebuilding - currently at 25%. I’ll just keep monitoring I suppose.
>
> This a.m. I backed up everything, rebuilt the RAID and restored hoping
> I’d get rid of the messages. Maybe I should have fsck’d the fs in the
> RAID before rebuilding :-/
>
> Alan
>
> [/QUOTE]
> Hi
> How are the hard drive temperatures? Do you have hddtemp installed?
>
I don’t have hddtemp installed but smartd reports
SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 113
Alan
[/QUOTE]
Hi
I’m guessing that is also degrees F not C?
–
Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.18-0.2-default
up 1 day 12:42, 1 user, load average: 0.79, 0.33, 0.12
GPU GeForce 6600 TE/6200 TE - Driver Version: 177.80
>>> Hi
>>> Run the raid tool mdadm --detail /dev/md(n) where n is the number.
>>>
>>
>> About 10 minutes after the original post I had problems with the RAID
>> - I’ve run mdadm --detail and I’ve got /dev/sde1 (4th disk in the 4
>> disk array, which has never reported a problem with smartd) out and
>> rebuilding - currently at 25%. I’ll just keep monitoring I suppose.
>>
>> This a.m. I backed up everything, rebuilt the RAID and restored hoping
>> I’d get rid of the messages. Maybe I should have fsck’d the fs in the
>> RAID before rebuilding :-/
>>
>> Alan
>>
>> [/QUOTE]
>> Hi
>> How are the hard drive temperatures? Do you have hddtemp installed?
>>
>
> I don’t have hddtemp installed but smartd reports
>
> SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 113
>
> Alan
>
> [/QUOTE]
> Hi
> I’m guessing that is also degrees F not C?
>
I’ve no idea - tho’ message does say Celsius.
However there’s no smoke, fire or steam, the case feels cool (a nice
aluminium Lian Li job with multiple fans). If this was a temp in C I’m sure
I’d have noticed the heat
Basically I’ve been ignoring this message after it first appeared and I
found the case etc. quite cool so I assumed it was spurious.
>>> Hi
>>> Run the raid tool mdadm --detail /dev/md(n) where n is the number.
>>>
>>
>> About 10 minutes after the original post I had problems with the RAID
>> - I’ve run mdadm --detail and I’ve got /dev/sde1 (4th disk in the 4
>> disk array, which has never reported a problem with smartd) out and
>> rebuilding - currently at 25%. I’ll just keep monitoring I suppose.
>>
>> This a.m. I backed up everything, rebuilt the RAID and restored
>> hoping I’d get rid of the messages. Maybe I should have fsck’d the
>> fs in the RAID before rebuilding :-/
>>
>> Alan
>>
>> [/QUOTE]
>> Hi
>> How are the hard drive temperatures? Do you have hddtemp installed?
>>
>
> I don’t have hddtemp installed but smartd reports
>
> SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 113
>
> Alan
>
> [/QUOTE]
> Hi
> I’m guessing that is also degrees F not C?
>
I’ve no idea - tho’ message does say Celsius.
However there’s no smoke, fire or steam, the case feels cool (a nice
aluminium Lian Li job with multiple fans). If this was a temp in C I’m
sure I’d have noticed the heat
Basically I’ve been ignoring this message after it first appeared and I
found the case etc. quite cool so I assumed it was spurious.
Alan
[/QUOTE]
Hi
Well I would install hddtemp then and just verify…
–
Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.18-0.2-default
up 1 day 15:14, 1 user, load average: 0.83, 0.98, 0.48
GPU GeForce 6600 TE/6200 TE - Driver Version: 177.80
FWIW, my experience with SMART is that it sometimes fires false-positives, but it almost never misses an upcoming HDD failure. I’ve learned this the hard way over the years.
My take-away from this is to not necessarily panic over SMART errors, but make sure you are prepared, particularly if it’s anything important or mission-ciritical.
>
> FWIW, my experience with SMART is that it sometimes fires
> false-positives, but it almost never misses an upcoming HDD failure.
> I’ve learned this the hard way over the years.
>
> My take-away from this is to not necessarily panic over SMART errors,
> but make sure you are prepared, particularly if it’s anything important
> or mission-ciritical.
>
> Just my 2c…
>
> Cheers,
> KV
>
>
All the important stuff gets backed up overnight to external USB HDD and
I’ve got a spare HDD sat waiting to go in when one dies.
>>>> Hi
>>>> Run the raid tool mdadm --detail /dev/md(n) where n is the number.
>>>>
>>>
>>> About 10 minutes after the original post I had problems with the RAID
>>> - I’ve run mdadm --detail and I’ve got /dev/sde1 (4th disk in the 4
>>> disk array, which has never reported a problem with smartd) out and
>>> rebuilding - currently at 25%. I’ll just keep monitoring I suppose.
>>>
>>> This a.m. I backed up everything, rebuilt the RAID and restored
>>> hoping I’d get rid of the messages. Maybe I should have fsck’d the
>>> fs in the RAID before rebuilding :-/
>>>
>>> Alan
>>>
>>> [/QUOTE]
>>> Hi
>>> How are the hard drive temperatures? Do you have hddtemp installed?
>>>
>>
>> I don’t have hddtemp installed but smartd reports
>>
>> SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 113
>>
>> Alan
>>
>> [/QUOTE]
>> Hi
>> I’m guessing that is also degrees F not C?
>>
>
> I’ve no idea - tho’ message does say Celsius.
>
> However there’s no smoke, fire or steam, the case feels cool (a nice
> aluminium Lian Li job with multiple fans). If this was a temp in C I’m
> sure I’d have noticed the heat
>
> Basically I’ve been ignoring this message after it first appeared and I
> found the case etc. quite cool so I assumed it was spurious.
>
> Alan
>
> [/QUOTE]
> Hi
> Well I would install hddtemp then and just verify…
>
hddtemp installed and reports 35°C across the board - now I’ll have to
dig back into smartd.conf to stop it reporting temperature changes (tho’ I
do remember googling out an article which limited reporting to +/- 5° which
may be a better option)