Today one of my disks in raid5 failed. I am unsure what to do next and fear I am sitting on a time bomb… mdadm removed the failing disk from md0 and md0 is working again. I will copy my data before using or fixing it! (first I need a big disk :sarcastic:)
I like to know how to check the failing disc. Is it really bad or is it the file system.
dmesg
SCSI error : <2 0 0 0> return code = 0x8000002
sdc: Current: sense key: Aborted Command
Additional sense: Scsi parity error
end_request: I/O error, dev sdc, sector 305964919
raid5: Disk failure on sdc1, disabling device. Operation continuing on 3 devices
ata4: command 0x25 timeout, stat 0x50 host_stat 0x64
ata3: command 0x25 timeout, stat 0xff host_stat 0x65
ata3: status=0xff { Busy }
mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.02
Creation Time : Sun Jul 22 18:50:51 2007
Raid Level : raid5
Array Size : 732563712 (698.63 GiB 750.15 GB)
Used Dev Size : 244187904 (232.88 GiB 250.05 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sat Oct 4 16:33:31 2008
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 1
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : 0100ad93:3ac4b5f7:8745e70a:d94a2bba
Events : 0.342136
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
2 0 0 2 removed
3 8 49 3 active sync /dev/sdd1
4 8 33 - faulty spare /dev/sdc1
>
> Help needed!
>
> Today one of my disks in raid5 failed. I am unsure what to do next and
> fear I am sitting on a time bomb… mdadm removed the failing disk from
> md0 and md0 is working again. I will copy my data before using or fixing
> it! (first I need a big disk :sarcastic:)
>
> I like to know how to check the failing disc. Is it really bad or is it
> the file system.
>
If its a Seagate drive, they have a linux command line diagnostic for most
of there drives or if its a Maxtor with an actual Seagate drive the generic
tests in the diagnostic work. http://www.seagate.com/www/en-us/support/downloads/seatools/ at the very
bottom of the page is the download. I ran it on openSUSE 11.0.
Don’t know about the raid five part. Seagate says they do not support Linux
if you call tech support. They did not even know about the diagnostic. But
they replaced two drives when I read them the messages from the diagnostic,
they were under warranty and were actually Maxtor 320 GB drives, that are
actually a SEAGATE Barracuda drive.
If the Maxtor model number is STM or ST its a Seagate drive from what I was
told and the diagnostic will work.
smartctl version 5.33 [i686-pc-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
Device: ATA WDC WD2500KS-00M Version: 02.0
SATA disks accessed via libata are not currently supported by
smartmontools. When libata is given an ATA pass-thru ioctl() then an
additional '-d libata' device type will be added to smartmontools.
not sure what to do with this meassage. My raid5 is on SuSE 10.0, this means I do not have access to newer updates of smartctl.
hddtemp /dev/sdc1
WARNING: Drive /dev/sdc1 doesn't seem to have a temperature sensor.
WARNING: This doesn't mean it hasn't got one.
WARNING: If you are sure it has one, please contact me (hddtemp@guzu.net).
WARNING: See --help, --debug and --drivebase options.
/dev/sdc1: Esi- ,erdas atdnra dni: no sensor
Euhmm to answer the question, no I was not running hddtemp.
@upscope
Good idea to run diagnostic tool from WD. Strangely the tool aborts with error. I give it another go or two and post the results.
Hi
Grab the packman version of "hddtemp-0.3_beta15-10.pm.1.<arch>.rpm
for 10, that should work for those disks.
I would also try upgrading your smartmontools to 5.38 as this works
fine on my SLED10 system (based on 10.1) and reports the information
for my two WD 36GB Raptors and two WD 250 SE’s.
So sorry for the slow responses. It is a painful process getting my data to safety. I guess it will take me the whole day.
I did manage to run the WD diagnostic tool, but it gave no errors in the quick test. The advanced test might destroy data so this have to wait until backups are done. (mainly because I don’t know witch drive is sdc1 :shame:)
@malcolmlewis
I do run hddtemp version 0.3-beta15 but it does not see a sensor.
I did update to smartctl 5.39
smartctl -a /dev/sdc
smartctl 5.39 2008-08-16 16:49 [i686-suse-linux] (local build)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net
Probable ATA device behind a SAT layer
Try an additional '-d ata' or '-d sat' argument.
But still no info…
I use sata drives, the mainboard had only 2 x sata so I had to add a simple 2x sata pci-board. I guess sdc is on the sil-sata board
lspci
02:0a.0 Unknown mass storage controller: Silicon Image, Inc. SiI 3512 [SATALink/SATARaid] Serial ATA Controller (rev 01)
But this sil board is not the problem, none of the 4 disc give a read-out with smartctl
Ok, my data is safe. The extended test (100% on the disabled drive) did not return any errors. So thats it, the bomb might be still ticking.
The diagnostic tool is writing zero’s to the whole disk. Afterward I will add it to the raid array again.
It leaves me with some questions. I realy like to get the monitor tools ‘hddtemp’ and ‘smartmontools’ to work.
If it can’t be done on openSuse 10.0 I want to move to 11.0, rebuild my raid5 in it and copy my data on it. (Can I switch to 11.0 and mount md0 without loosing the data on it?)
…how old are the drives?
Well i had to look it up and guess what… I bought them 04 sept 2006. Guarantee is 2 years. >:(
WD tools say the drive is fine, who is right in this case?
About usage:
The system is for photos and home movies, so perhaps it is powered on 16h every week with one user. It is a tower with 4 fans for cooling dedicated as samba network drive. It is not a workstation.
At the moment the system is rebuilding the array with the suspicious drive.
Hi
Well they should support hddtemp then, I have WD’s the raptors are 2004
and the SE’s are 2005. Is smart enabled in the BIOS?
I would tend to agree with the manufactures tools… I would guess it’s
a temperature issue as the SE’s on mine run hot I use to have three,
but removed one and just have a mirror and the other one as a backup
drive.
Are the fans blowing, sucking or both? Sometimes too many fans can
impede airflow/cooling…
–
Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.16-0.1-default
up 4:40, 4 users, load average: 0.18, 0.22, 0.12
GPU GeForce 6600 TE/6200 TE - Driver Version: 173.14.12
WDC WD2500KS-00MJB0 | 194 | Western Digital Caviar SE16 250GB 16MB
So it should work but all I get is:
hddtemp /dev/sdc1
WARNING: Drive /dev/sdc1 doesn't seem to have a temperature sensor.
WARNING: This doesn't mean it hasn't got one.
WARNING: If you are sure it has one, please contact me (hddtemp@guzu.net).
WARNING: See --help, --debug and --drivebase options.
/dev/sdc1: Esi- ,erdas atdnra dni: no sensor
It would help a lot to optimize cooling if I can use the sensors.
Ok, bad luck for me I guess… It’s a kernel issue… Only solvable with a Suse update.
I booted with a newer kernel I needed 2 years ago to grow my raid5 without data loss.
uname -a
Linux server 2.6.17.14-default #1 Wed Jul 25 19:47:49 CEST 2007 i686 athlon i386 GNU/Linux
With it the tools work, but lot’s of other stuff like usb does NOT work!
Hopefully you are able to tell something more about the condition of my drive with this info. There is: UDMA_CRC_Error_Count = 1 (the disk is only 900h old)
smartctl -a -d ata /dev/sdc
smartctl 5.39 2008-08-16 16:49 [i686-suse-linux] (local build)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar SE16 Serial ATA family
Device Model: WDC WD2500KS-00MJB0
Serial Number: WD-WCANKH294925
Firmware Version: 02.01C03
User Capacity: 250,059,350,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Oct 6 19:50:12 2008 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (8280) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 96) minutes.
Conveyance self-test routine
recommended polling time: ( 6) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 193 179 021 Pre-fail Always - 5308
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 201
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 901
10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 196
190 Airflow_Temperature_Cel 0x0022 056 030 045 Old_age Always In_the_past 44
194 Temperature_Celsius 0x0022 106 080 000 Old_age Always - 44
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1
200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Conveyance offline Completed without error 00% 895 -
# 2 Conveyance offline Completed without error 00% 892 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
No, I do not fear the ‘old’ temp reading of 62 degree. It is the root disk of the system that can be swapped easy with an other. Only 5G is used and I do have a backup. It’s a second hand drive, so who knows what happened with it…
UDMA_CRC_Error_Count → On the internet I found 2 possible causes
bad cable connection
insufficient power supply
So perhaps I should not worry, though I do believe it did break my raid array.
recall system log:
SCSI error : <2 0 0 0> return code = 0x8000002
sdc: Current: sense key: Aborted Command
Additional sense: Scsi parity error
end_request: I/O error, dev sdc, sector 305964919
raid5: Disk failure on sdc1, disabling device. Operation continuing on 3 devices
ata4: command 0x25 timeout, stat 0x50 host_stat 0x64
ata3: command 0x25 timeout, stat 0xff host_stat 0x65
ata3: status=0xff { Busy }
I like to close the error quest and see what the future will bring. To generate an automated warning for high temperatures I need to update suse or perhaps load a special kernel module. The original 2.6.13 can not use hddtemp directly on sata disks.
How do you automate a temp warning? (I remind you that my box is not a workstation, so when it is on there is no screen and only a system speaker…)
Hi
Have a look at the man page for hddtemp you can run it as a daemon and
remotely monitor. So many ways…log it, run a cronjob/script that could
beep etc etc…
I think at this stage if you decide how you want to get hddtemp running
then post back with a new query
–
Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.16-0.1-default
up 5:42, 1 user, load average: 0.06, 0.14, 0.25
GPU GeForce 6600 TE/6200 TE - Driver Version: 173.14.12