raid5: Disk failure on sdc1

Help needed!

Today one of my disks in raid5 failed. I am unsure what to do next and fear I am sitting on a time bomb… mdadm removed the failing disk from md0 and md0 is working again. I will copy my data before using or fixing it! (first I need a big disk :sarcastic:)

I like to know how to check the failing disc. Is it really bad or is it the file system.

dmesg

SCSI error : <2 0 0 0> return code = 0x8000002
sdc: Current: sense key: Aborted Command
    Additional sense: Scsi parity error
end_request: I/O error, dev sdc, sector 305964919
raid5: Disk failure on sdc1, disabling device. Operation continuing on 3 devices
ata4: command 0x25 timeout, stat 0x50 host_stat 0x64
ata3: command 0x25 timeout, stat 0xff host_stat 0x65
ata3: status=0xff { Busy }

mdadm --detail /dev/md0

/dev/md0:
        Version : 00.90.02
  Creation Time : Sun Jul 22 18:50:51 2007
     Raid Level : raid5
     Array Size : 732563712 (698.63 GiB 750.15 GB)
  Used Dev Size : 244187904 (232.88 GiB 250.05 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat Oct  4 16:33:31 2008
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 0100ad93:3ac4b5f7:8745e70a:d94a2bba
         Events : 0.342136

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       0        0        2      removed
       3       8       49        3      active sync   /dev/sdd1

       4       8       33        -      faulty spare   /dev/sdc1

Hi
Run a smartctl test on the device. Are you monitoring the hdd temps
(hddtemp from packman)?


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.16-0.1-default
up 4:34, 2 users, load average: 0.10, 0.43, 0.58
GPU GeForce 6600 TE/6200 TE - Driver Version: 173.14.12

Fire11 wrote:

>
> Help needed!
>
> Today one of my disks in raid5 failed. I am unsure what to do next and
> fear I am sitting on a time bomb… mdadm removed the failing disk from
> md0 and md0 is working again. I will copy my data before using or fixing
> it! (first I need a big disk :sarcastic:)
>
> I like to know how to check the failing disc. Is it really bad or is it
> the file system.
>
If its a Seagate drive, they have a linux command line diagnostic for most
of there drives or if its a Maxtor with an actual Seagate drive the generic
tests in the diagnostic work.
http://www.seagate.com/www/en-us/support/downloads/seatools/ at the very
bottom of the page is the download. I ran it on openSUSE 11.0.

Don’t know about the raid five part. Seagate says they do not support Linux
if you call tech support. They did not even know about the diagnostic. But
they replaced two drives when I read them the messages from the diagnostic,
they were under warranty and were actually Maxtor 320 GB drives, that are
actually a SEAGATE Barracuda drive.
If the Maxtor model number is STM or ST its a Seagate drive from what I was
told and the diagnostic will work.


Russ
Linux register user 441463
openSUSE11.0

@malcolmlewis

smartctl -a /dev/sdc

smartctl version 5.33 [i686-pc-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: ATA      WDC WD2500KS-00M Version: 02.0

SATA disks accessed via libata are not currently supported by
smartmontools. When libata is given an ATA pass-thru ioctl() then an
additional '-d libata' device type will be added to smartmontools.

not sure what to do with this meassage. My raid5 is on SuSE 10.0, this means I do not have access to newer updates of smartctl.

hddtemp /dev/sdc1

WARNING: Drive /dev/sdc1 doesn't seem to have a temperature sensor.
WARNING: This doesn't mean it hasn't got one.
WARNING: If you are sure it has one, please contact me (hddtemp@guzu.net).
WARNING: See --help, --debug and --drivebase options.
/dev/sdc1: Esi- ,erdas atdnra dni:  no sensor

Euhmm to answer the question, no I was not running hddtemp.

@upscope
Good idea to run diagnostic tool from WD. Strangely the tool aborts with error. I give it another go or two and post the results.

Hi
Grab the packman version of "hddtemp-0.3_beta15-10.pm.1.<arch>.rpm
for 10, that should work for those disks.

I would also try upgrading your smartmontools to 5.38 as this works
fine on my SLED10 system (based on 10.1) and reports the information
for my two WD 36GB Raptors and two WD 250 SE’s.

You can search here for the rpm http://rpm.pbone.net/


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.16-0.1-default
up 16:27, 1 user, load average: 0.23, 0.10, 0.08
GPU GeForce 6600 TE/6200 TE - Driver Version: 173.14.12

So sorry for the slow responses. It is a painful process getting my data to safety. I guess it will take me the whole day.

I did manage to run the WD diagnostic tool, but it gave no errors in the quick test. The advanced test might destroy data so this have to wait until backups are done. (mainly because I don’t know witch drive is sdc1 :shame:)

@malcolmlewis
I do run hddtemp version 0.3-beta15 but it does not see a sensor.
I did update to smartctl 5.39

smartctl -a /dev/sdc

smartctl 5.39 2008-08-16 16:49 [i686-suse-linux] (local build)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net


Probable ATA device behind a SAT layer
Try an additional '-d ata' or '-d sat' argument.

But still no info…
I use sata drives, the mainboard had only 2 x sata so I had to add a simple 2x sata pci-board. I guess sdc is on the sil-sata board

lspci

02:0a.0 Unknown mass storage controller: Silicon Image, Inc. SiI 3512 [SATALink/SATARaid] Serial ATA Controller (rev 01)

But this sil board is not the problem, none of the 4 disc give a read-out with smartctl

Ok, my data is safe. The extended test (100% on the disabled drive) did not return any errors. So thats it, the bomb might be still ticking.

The diagnostic tool is writing zero’s to the whole disk. Afterward I will add it to the raid array again.

It leaves me with some questions. I realy like to get the monitor tools ‘hddtemp’ and ‘smartmontools’ to work.
If it can’t be done on openSuse 10.0 I want to move to 11.0, rebuild my raid5 in it and copy my data on it. (Can I switch to 11.0 and mount md0 without loosing the data on it?)

Hi
Glad it’s all starting to sort itself out. So your version of hddtemp
isn’t working? How old are the drives?


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.16-0.1-default
up 7:43, 1 user, load average: 0.00, 0.00, 0.03
GPU GeForce 6600 TE/6200 TE - Driver Version: 173.14.12

@malcolmlewis

…how old are the drives?
Well i had to look it up and guess what… I bought them 04 sept 2006. Guarantee is 2 years. >:(

WD tools say the drive is fine, who is right in this case?

About usage:
The system is for photos and home movies, so perhaps it is powered on 16h every week with one user. It is a tower with 4 fans for cooling dedicated as samba network drive. It is not a workstation.

At the moment the system is rebuilding the array with the suspicious drive.

Personalities : [raid5]
md0 : active raid5 sdc1[4] sda1[0] sdd1[3] sdb1[1]
      732563712 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
      ====>................]  recovery = 21.3% (52047744/244187904) finish=60.9min speed=52573K/sec

Hi
Well they should support hddtemp then, I have WD’s the raptors are 2004
and the SE’s are 2005. Is smart enabled in the BIOS?

I would tend to agree with the manufactures tools… I would guess it’s
a temperature issue as the SE’s on mine run hot I use to have three,
but removed one and just have a mirror and the other one as a backup
drive.

Are the fans blowing, sucking or both? Sometimes too many fans can
impede airflow/cooling…


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.16-0.1-default
up 4:40, 4 users, load average: 0.18, 0.22, 0.12
GPU GeForce 6600 TE/6200 TE - Driver Version: 173.14.12

Well yes s.m.a.r.t. is enabled in the bios.

hddtemp -b

WDC WD2500KS-00MJB0                                             |   194 | Western Digital Caviar SE16 250GB 16MB

So it should work but all I get is:

hddtemp /dev/sdc1
WARNING: Drive /dev/sdc1 doesn't seem to have a temperature sensor.
WARNING: This doesn't mean it hasn't got one.
WARNING: If you are sure it has one, please contact me (hddtemp@guzu.net).
WARNING: See --help, --debug and --drivebase options.
/dev/sdc1: Esi- ,erdas atdnra dni:  no sensor

It would help a lot to optimize cooling if I can use the sensors.

Both commands do run on my root disc. It is an ‘old’ 120Gb hitachi drive…

smartctl -a /dev/hda

smartctl 5.39 2008-08-16 16:49 [i686-suse-linux] (local build)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     IBM/Hitachi Deskstar 120GXP family
Device Model:     IC35L120AVVA07-0
Serial Number:    VNC602A6L0ERTA
Firmware Version: VA6OA52A
User Capacity:    123,522,416,640 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   5
ATA Standard is:  ATA/ATAPI-5 T13 1321D revision 1
Local Time is:    Mon Oct  6 19:11:01 2008 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (3399) seconds.
Offline data collection
capabilities:                    (0x1b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  57) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   060    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   095   095   024    Pre-fail  Always       -       356 (Average 356)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       1429
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   020    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       7676
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1422
192 Power-Off_Retract_Count 0x0032   099   099   050    Old_age   Always       -       1663
193 Load_Cycle_Count        0x0012   099   099   050    Old_age   Always       -       1663
194 Temperature_Celsius     0x0002   177   177   000    Old_age   Always       -       31 (Lifetime Min/Max 9/62)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       7
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       37

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


Device does not support Selective Self Tests/Logging

hddtemp /dev/hda

/dev/hda: IC35L120AVVA07-0: 31°C

They won’t work on my sata drives

Ok, bad luck for me I guess… It’s a kernel issue… Only solvable with a Suse update.

I booted with a newer kernel I needed 2 years ago to grow my raid5 without data loss.

uname -a

Linux server 2.6.17.14-default #1 Wed Jul 25 19:47:49 CEST 2007 i686 athlon i386 GNU/Linux

With it the tools work, but lot’s of other stuff like usb does NOT work!
Hopefully you are able to tell something more about the condition of my drive with this info. There is: UDMA_CRC_Error_Count = 1 (the disk is only 900h old)

smartctl -a -d ata /dev/sdc

smartctl 5.39 2008-08-16 16:49 [i686-suse-linux] (local build)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE16 Serial ATA family
Device Model:     WDC WD2500KS-00MJB0
Serial Number:    WD-WCANKH294925
Firmware Version: 02.01C03
User Capacity:    250,059,350,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Oct  6 19:50:12 2008 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (8280) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  96) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   193   179   021    Pre-fail  Always       -       5308
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       201
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       901
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       196
190 Airflow_Temperature_Cel 0x0022   056   030   045    Old_age   Always   In_the_past 44
194 Temperature_Celsius     0x0022   106   080   000    Old_age   Always       -       44
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%       895         -
# 2  Conveyance offline  Completed without error       00%       892         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

hddtemp /dev/sdc

/dev/sdc: WDC WD2500KS-00MJB0: 44°C

Hi
Here is the list from my 4 drives for comparison
http://nopaste.com/p/aB0pvwj5H

I wouldn’t be too concerned with the 1 error… but on one of your
previous posts temperature did get up to 62?? That I would be
concerned about.


194 Temperature_Celsius     0x0002   177   177   000    Old_age
Always       -       31 (Lifetime Min/Max 9/62)

Can you get hddtemp to work with your original kernel?


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.16-0.1-default
up 7:30, 4 users, load average: 0.31, 0.29, 0.20
GPU GeForce 6600 TE/6200 TE - Driver Version: 173.14.12

@malcolmlewis

Many thanks for your log!

No, I do not fear the ‘old’ temp reading of 62 degree. It is the root disk of the system that can be swapped easy with an other. Only 5G is used and I do have a backup. It’s a second hand drive, so who knows what happened with it…

UDMA_CRC_Error_Count → On the internet I found 2 possible causes

  • bad cable connection
  • insufficient power supply

So perhaps I should not worry, though I do believe it did break my raid array.

recall system log:

SCSI error : <2 0 0 0> return code = 0x8000002
sdc: Current: sense key: Aborted Command
    Additional sense: Scsi parity error
end_request: I/O error, dev sdc, sector 305964919
raid5: Disk failure on sdc1, disabling device. Operation continuing on 3 devices
ata4: command 0x25 timeout, stat 0x50 host_stat 0x64
ata3: command 0x25 timeout, stat 0xff host_stat 0x65
ata3: status=0xff { Busy }

I like to close the error quest and see what the future will bring. To generate an automated warning for high temperatures I need to update suse or perhaps load a special kernel module. The original 2.6.13 can not use hddtemp directly on sata disks.

How do you automate a temp warning? (I remind you that my box is not a workstation, so when it is on there is no screen and only a system speaker…)

Hi
Have a look at the man page for hddtemp you can run it as a daemon and
remotely monitor. So many ways…log it, run a cronjob/script that could
beep etc etc…

I think at this stage if you decide how you want to get hddtemp running
then post back with a new query :slight_smile:


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.16-0.1-default
up 5:42, 1 user, load average: 0.06, 0.14, 0.25
GPU GeForce 6600 TE/6200 TE - Driver Version: 173.14.12

agreed, thanks for your time.

:shake: