Software RAID rebuilds continuously

Hi,

I have an OpenSuse 11.0 file server with software RAID. The RAID configuration is two mirrored discs plus a hot spare. Below is the configuration showing the partitions md0, md1 and md2 (sdb was the spare):

/boot 200Mb sda1, sdb1, sdc1 md0
/ 20Gb sda2, sdb2, sdc2 md1
/data 900Gb sda3, sdb3, sdc3 md2

Yesterday morning one of the active discs (sda) started to give read errors and was failed by RAID. The spare disc was activated automatically and re-syncing started. This is all great so far but the re-syncing keeps on restarting. What happens is md1 re-syncs to 100% (see number 1. below) which takes a few minutes but md1 stays degraded. Next md2 starts to re-sync briefly (see 2. below) but never gets very far because the re-syncing stops and re-starts again on md1 (3. below). This cycles continuously and has been going on for over 26 hours.

admin@sherbet:~> cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4]
md2 : active raid1 sda30 sdb3[2] sdc3[1]
953160408 blocks super 1.0 [2/1] [_U]
resync=DELAYED
bitmap: 1/455 pages [4KB], 1024KB chunk

md0 : active raid1 sda10 sdb1[2] sdc1[1]
216832 blocks super 1.0 [2/2] [UU]
bitmap: 0/7 pages [0KB], 16KB chunk

md1 : active raid1 sda20 sdb2[2] sdc2[1]
20972784 blocks super 1.0 [2/1] [_U]
===================>.] recovery = 98.1% (20587904/20972784) finish=0.1min speed=39262K/sec
bitmap: 1/161 pages [4KB], 64KB chunk

unused devices: <none>

admin@sherbet:~> cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4]
md2 : active raid1 sda30 sdb3[2] sdc3[1]
953160408 blocks super 1.0 [2/1] [_U]
>…] recovery = 0.0% (768/953160408) finish=39714.9min speed=384K/sec
bitmap: 1/455 pages [4KB], 1024KB chunk

md0 : active raid1 sda10 sdb1[2] sdc1[1]
216832 blocks super 1.0 [2/2] [UU]
bitmap: 0/7 pages [0KB], 16KB chunk

md1 : active raid1 sda20 sdb2[2] sdc2[1]
20972784 blocks super 1.0 [2/1] [_U]
bitmap: 1/161 pages [4KB], 64KB chunk

unused devices: <none>

admin@sherbet:~> cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4]
md2 : active raid1 sda30 sdb3[2] sdc3[1]
953160408 blocks super 1.0 [2/1] [_U]
resync=DELAYED
bitmap: 1/455 pages [4KB], 1024KB chunk

md0 : active raid1 sda10 sdb1[2] sdc1[1]
216832 blocks super 1.0 [2/2] [UU]
bitmap: 0/7 pages [0KB], 16KB chunk

md1 : active raid1 sda20 sdb2[2] sdc2[1]
20972784 blocks super 1.0 [2/1] [_U]
>…] recovery = 1.3% (283328/20972784) finish=8.5min speed=40475K/sec
bitmap: 0/161 pages [0KB], 64KB chunk

unused devices: <none>

My question is, is this normal? Am I being impatient or is there something wrong?

Thanks in advance for any advice.

Hello wickedmonkey. I see this is your first post, thus welcomee here.

Please, when you next time post computer text, do so between CODE tags (Posting in Code Tags - A Guide) to make them better readable.

Also, please note that openSUSE 11.0 is way out of support. That does not mean that people here will not try to help you, but almost none off them will have a running 11.0 and thus they will not be able to redo on their systems what you have. And when a possible solution might involve new version(s) of some package(s), you might not be able to find them for 11.0.

> please note that openSUSE 11.0 is way out of support.

but SUSE Linux Enterprise Server/D version 11.0 is still supported…i
wonder if it is that which you are running?? lets see with

in a terminal


cat /etc/SuSE-release

if you have the enterprise version you are welcome to seek advice here,
but BE ADVISED that these are the openSUSE forums and many of the
answers might be from folks who have never run SLES/D (or maybe never
even heard of it before) and you are likely much better off if you seek
assistance from the Attachmate/SUSE forums (where the SUSE Linux
Enterprise gurus hang out), at: http://tinyurl.com/422mrnu


DD
openSUSE®, the “German Automobiles” of operating systems

Thanks for the info, guys.

The release is

openSUSE 11.0

On 10/26/2011 03:26 PM, wickedmonkey wrote:
> openSUSE 11.0

too bad, not many around have run that so far past its end of life to
know where to start in helping you…i sure can’t…heck, that distro
died on July 26th, 2010 and has received zero security updates since…

so to your original question:

“My question is, is this normal? Am I being impatient or is there
something wrong?”

to run a system 16 months with no patches is both ‘wrong’ and not normal
practice…i suggest you update your software as you cover your hardware
problem…


DD
openSUSE®, the “German Automobiles” of operating systems

Thanks DD.

Whilst I appreciate your response, it’s not actually very helpful. Release 11 of OpenSuse has been secure, stable, fast and robust and running perfectly for 3 to 4 years non-stop. It hasn’t broken because it is old or because it is no longer supported with software releases. It has broken due to either disc or controller failure I was really hoping for feedback from someone who had experience with Linux software RAID – mdadm, etc.

Hi
The issue is that things move on and a user could give advice that works fine in the supported releases, but may (unlikely) be fatal on an earlier unsupported release.

Why not use the proper mdadm commands to query the array for example?


mdadm --misc --detail /dev/md0

Open up some console sessions and run the monitor to see what’s happening, eg?


mdadm --monitor /dev/md1

wickedmonkey wrote:
> Release 11 of OpenSuse has been secure, stable, fast and robust and
> running perfectly for 3 to 4 years non-stop. It hasn’t broken because it
> is old or because it is no longer supported with software releases. It
> has broken due to either disc or controller failure I was really hoping
> for feedback from someone who had experience with Linux software RAID –
> mdadm, etc.

You’re correct about why it broke, but don’t be under any illusion that
it is still secure! Hopefully, it’s in an environment where it doesn’t
need to be secure.

I do use mdadm but thankfully, I’ve never seen the same symptoms, so I
haven’t been able to help. I would suggest that if you don’t find any
help here, that you try the linux-ide list. But note that you’ll
probably get the suggestion to upgrade there as well, since what you’re
seeing might well be a bug in mdadm that’s already been fixed.

If you can take the system offline, you could always boot off a live DVD
or somesuch and see if the mdadm on that can fix the problem.

I guess you’re already googling to find any mention of something similar.

Thankyou Mr Lewis.

The mdadm details are as follows:


/dev/md1:
        Version : 01.00.03
  Creation Time : Thu Jul 17 13:42:15 2008
     Raid Level : raid1
     Array Size : 20972784 (20.00 GiB 21.48 GB)
  Used Dev Size : 41945568 (40.00 GiB 42.95 GB)
   Raid Devices : 2
  Total Devices : 3
Preferred Minor : 1
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Oct 26 17:04:09 2011
          State : active, degraded, recovering
 Active Devices : 1
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 1

 Rebuild Status : 5% complete

           Name : 1
           UUID : a4a2bb83:debbae5a:ec434bff:3e9eeeff
         Events : 25780

    Number   Major   Minor   RaidDevice State
       2       8       18        0      spare rebuilding   /dev/sdb2
       1       8       34        1      active sync   /dev/sdc2

       0       8        2        -      faulty spare   /dev/sda2


/dev/md0:
        Version : 01.00.03
  Creation Time : Thu Jul 17 13:42:14 2008
     Raid Level : raid1
     Array Size : 216832 (211.79 MiB 222.04 MB)
  Used Dev Size : 216832 (211.79 MiB 222.04 MB)
   Raid Devices : 2
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Oct 25 09:40:39 2011
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 0

           Name : 0
           UUID : 4effba6b:dda0ab06:7f58eea2:500fc74d
         Events : 76

    Number   Major   Minor   RaidDevice State
       2       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1

       0       8        1        -      faulty spare   /dev/sda1


/dev/md2:
        Version : 01.00.03
  Creation Time : Thu Jul 17 13:42:16 2008
     Raid Level : raid1
     Array Size : 953160408 (909.00 GiB 976.04 GB)
  Used Dev Size : 1906320816 (1818.01 GiB 1952.07 GB)
   Raid Devices : 2
  Total Devices : 3
Preferred Minor : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Oct 26 17:08:22 2011
          State : active, degraded
 Active Devices : 1
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 1

           Name : 2
           UUID : f40bc0b5:82db02c0:ae6427d8:d3828155
         Events : 15752

    Number   Major   Minor   RaidDevice State
       2       8       19        0      spare rebuilding   /dev/sdb3
       1       8       35        1      active sync   /dev/sdc3

       0       8        3        -      faulty spare   /dev/sda3

The monitor command will take a while for anything useful to appear so I’l’ leave it running on /dev/md1 for a while and post the results.

Thankyou, djh.

It is secured against the outside world and has very limited port connectivity internally. I’ve seen plenty of servers trashed from updates (sometimes security updates) that I’m wary about blindly applying updates where they’re not needed – especially if something’s working well.

I tested the RAID including degrading it and rebuilding with mdadm and also automatically before I put it live and it worked fine so I’m a little miffed…

Hi
I would offline the faulty device as well, maybe that’s causing it to rebuild?

On 10/26/2011 06:26 PM, wickedmonkey wrote:
> I’m wary about blindly
> applying updates where they’re not needed – especially if something’s
> working well.

don’t get me wrong…i cling to “If it ain’t broke don’t fix it” like
white on rice…

the thing is, imHo (which others here probably may not agree with)
openSUSE is a poor choice for any commercial or private user who wishes
to run “perfectly for 3 to 4 years non-stop” because of the short life
of openSUSE dictates a upgrade or new install about every 15 to 18 months…

for example the most recent supported version (11.4) dies on September
15th 2012 cite <http://en.opensuse.org/Lifetime> after being first
released on March 10th 2011…


DD
openSUSE®, the “German Automobiles” of operating systems

Results from the monitor command for /dev/md1:


Oct 26 21:22:10: Rebuild40 on /dev/md1 unknown device
Oct 26 21:23:10: Rebuild60 on /dev/md1 unknown device
Oct 26 21:25:10: Rebuild80 on /dev/md1 unknown device
Oct 26 21:25:53: RebuildFinished on /dev/md1 unknown device
Oct 26 21:26:07: RebuildStarted on /dev/md1 unknown device
Oct 26 21:28:07: Rebuild20 on /dev/md1 unknown device
Oct 26 21:30:07: Rebuild40 on /dev/md1 unknown device
Oct 26 21:31:07: Rebuild60 on /dev/md1 unknown device
Oct 26 21:33:07: Rebuild80 on /dev/md1 unknown device
Oct 26 21:33:51: RebuildFinished on /dev/md1 unknown device
Oct 26 21:34:08: RebuildStarted on /dev/md1 unknown device
Oct 26 21:36:08: Rebuild20 on /dev/md1 unknown device
Oct 26 21:38:08: Rebuild40 on /dev/md1 unknown device
Oct 26 21:39:08: Rebuild60 on /dev/md1 unknown device
Oct 26 21:41:08: Rebuild80 on /dev/md1 unknown device
Oct 26 21:41:43: RebuildFinished on /dev/md1 unknown device
Oct 26 21:41:48: RebuildStarted on /dev/md1 unknown device
Oct 26 21:43:48: Rebuild20 on /dev/md1 unknown device
Oct 26 21:45:48: Rebuild40 on /dev/md1 unknown device
Oct 26 21:46:48: Rebuild60 on /dev/md1 unknown device
Oct 26 21:48:48: Rebuild80 on /dev/md1 unknown device
Oct 26 21:49:38: RebuildFinished on /dev/md1 unknown device
Oct 26 21:49:44: RebuildStarted on /dev/md1 unknown device
Oct 26 21:51:44: Rebuild20 on /dev/md1 unknown device
Oct 26 21:53:44: Rebuild40 on /dev/md1 unknown device
Oct 26 21:54:44: Rebuild60 on /dev/md1 unknown device
Oct 26 21:56:44: Rebuild80 on /dev/md1 unknown device
Oct 26 21:57:31: RebuildFinished on /dev/md1 unknown device
Oct 26 21:57:36: RebuildStarted on /dev/md1 unknown device
Oct 26 21:59:36: Rebuild20 on /dev/md1 unknown device
Oct 26 22:01:36: Rebuild40 on /dev/md1 unknown device
Oct 26 22:02:36: Rebuild60 on /dev/md1 unknown device
Oct 26 22:04:36: Rebuild80 on /dev/md1 unknown device
Oct 26 22:05:16: RebuildFinished on /dev/md1 unknown device
Oct 26 22:05:22: RebuildStarted on /dev/md1 unknown device
Oct 26 22:07:22: Rebuild20 on /dev/md1 unknown device
Oct 26 22:08:22: Rebuild40 on /dev/md1 unknown device
Oct 26 22:10:22: Rebuild60 on /dev/md1 unknown device
Oct 26 22:11:22: Rebuild80 on /dev/md1 unknown device
Oct 26 22:13:01: RebuildFinished on /dev/md1 unknown device
Oct 26 22:13:07: RebuildStarted on /dev/md1 unknown device
Oct 26 22:15:07: Rebuild20 on /dev/md1 unknown device
Oct 26 22:17:07: Rebuild40 on /dev/md1 unknown device
Oct 26 22:18:07: Rebuild60 on /dev/md1 unknown device
Oct 26 22:19:07: Rebuild80 on /dev/md1 unknown device
Oct 26 22:20:40: RebuildFinished on /dev/md1 unknown device
Oct 26 22:20:47: RebuildStarted on /dev/md1 unknown device
Oct 26 22:22:47: Rebuild20 on /dev/md1 unknown device
Oct 26 22:24:47: Rebuild40 on /dev/md1 unknown device
Oct 26 22:25:47: Rebuild60 on /dev/md1 unknown device
Oct 26 22:27:47: Rebuild80 on /dev/md1 unknown device
Oct 26 22:28:21: RebuildFinished on /dev/md1 unknown device
Oct 26 22:28:28: RebuildStarted on /dev/md1 unknown device
Oct 26 22:30:28: Rebuild20 on /dev/md1 unknown device
Oct 26 22:32:28: Rebuild40 on /dev/md1 unknown device
Oct 26 22:33:28: Rebuild60 on /dev/md1 unknown device
Oct 26 22:35:28: Rebuild80 on /dev/md1 unknown device
Oct 26 22:36:07: RebuildFinished on /dev/md1 unknown device
Oct 26 22:36:15: RebuildStarted on /dev/md1 unknown device
Oct 26 22:38:16: Rebuild20 on /dev/md1 unknown device
Oct 26 22:40:16: Rebuild40 on /dev/md1 unknown device
Oct 26 22:41:16: Rebuild60 on /dev/md1 unknown device
Oct 26 22:42:16: Rebuild80 on /dev/md1 unknown device
Oct 26 22:43:53: RebuildFinished on /dev/md1 unknown device
Oct 26 22:44:03: RebuildStarted on /dev/md1 unknown device
Oct 26 22:46:03: Rebuild20 on /dev/md1 unknown device
Oct 26 22:48:03: Rebuild40 on /dev/md1 unknown device
Oct 26 22:49:03: Rebuild60 on /dev/md1 unknown device
Oct 26 22:50:03: Rebuild80 on /dev/md1 unknown device
Oct 26 22:51:37: RebuildFinished on /dev/md1 unknown device
Oct 26 22:51:41: RebuildStarted on /dev/md1 unknown device
Oct 26 22:53:41: Rebuild20 on /dev/md1 unknown device
Oct 26 22:55:41: Rebuild40 on /dev/md1 unknown device
Oct 26 22:56:41: Rebuild60 on /dev/md1 unknown device
Oct 26 22:58:41: Rebuild80 on /dev/md1 unknown device
Oct 26 22:59:17: RebuildFinished on /dev/md1 unknown device
Oct 26 22:59:25: RebuildStarted on /dev/md1 unknown device
Oct 26 23:01:26: Rebuild20 on /dev/md1 unknown device
Oct 26 23:03:26: Rebuild40 on /dev/md1 unknown device
Oct 26 23:04:26: Rebuild60 on /dev/md1 unknown device
Oct 26 23:06:26: Rebuild80 on /dev/md1 unknown device
Oct 26 23:07:09: RebuildFinished on /dev/md1 unknown device
Oct 26 23:07:13: RebuildStarted on /dev/md1 unknown device
Oct 26 23:09:13: Rebuild20 on /dev/md1 unknown device
Oct 26 23:11:13: Rebuild40 on /dev/md1 unknown device
Oct 26 23:12:13: Rebuild60 on /dev/md1 unknown device
Oct 26 23:14:13: Rebuild80 on /dev/md1 unknown device
Oct 26 23:14:59: RebuildFinished on /dev/md1 unknown device
Oct 26 23:15:04: RebuildStarted on /dev/md1 unknown device
Oct 26 23:17:04: Rebuild20 on /dev/md1 unknown device
Oct 26 23:19:04: Rebuild40 on /dev/md1 unknown device
Oct 26 23:20:04: Rebuild60 on /dev/md1 unknown device
Oct 26 23:22:04: Rebuild80 on /dev/md1 unknown device
Oct 26 23:22:50: RebuildFinished on /dev/md1 unknown device
Oct 26 23:22:54: RebuildStarted on /dev/md1 unknown device
Oct 26 23:24:54: Rebuild20 on /dev/md1 unknown device
Oct 26 23:26:54: Rebuild40 on /dev/md1 unknown device
Oct 26 23:27:54: Rebuild60 on /dev/md1 unknown device
Oct 26 23:29:54: Rebuild80 on /dev/md1 unknown device
Oct 26 23:30:41: RebuildFinished on /dev/md1 unknown device
Oct 26 23:30:46: RebuildStarted on /dev/md1 unknown device
Oct 26 23:32:46: Rebuild20 on /dev/md1 unknown device
Oct 26 23:34:46: Rebuild40 on /dev/md1 unknown device
Oct 26 23:35:46: Rebuild60 on /dev/md1 unknown device
Oct 26 23:37:46: Rebuild80 on /dev/md1 unknown device
Oct 26 23:38:32: RebuildFinished on /dev/md1 unknown device
Oct 26 23:38:40: RebuildStarted on /dev/md1 unknown device

…etc, etc, etc…

I’ve heard that re-syncing can take a long time but this doesn’t look right to me and it’s been going for over 50 hours now.

Looking at the logs, I’m seeing read errors on the other active disc as well so, as all three discs are different brands, I’m thinking that the controller is on it’s way out.

wickedmonkey wrote:
> Looking at the logs, I’m seeing read errors on the other active disc as
> well so, as all three discs are different brands, I’m thinking that the
> controller is on it’s way out.

If it is, there should be libata error messages in /var/log/messages.

Though again, there’ve been lots of changes/bugfixes/improvements over
time, so maybe your system doesn’t log them in the same way. ]

And if you are seeing error messages, then that’s a very different
situation to mdadm mysteriously failing to rebuild an array.

Perhaps you should post your hardware details and the log messages that
you do see.

Sorry for the delay: t’Interweb issues.

No libata messages in /var/log/messages. The only thing I can see that may be relevant is the following:


Oct 27 14:51:36 sherbet kernel: 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC 
error:port=1.
Oct 27 14:51:36 sherbet kernel: sd 4:0:2:0: [sdc] Result: hostbyte=DID_OK driver
byte=DRIVER_SENSE,SUGGEST_OK
Oct 27 14:51:36 sherbet kernel: sd 4:0:2:0: [sdc] Sense Key : Medium Error [curr
ent] 
Oct 27 14:51:36 sherbet kernel: sd 4:0:2:0: [sdc] Add. Sense: Unrecovered read e
rror
Oct 27 14:51:36 sherbet kernel: end_request: I/O error, dev sdc, sector 42381006
Oct 27 14:51:36 sherbet kernel: 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC 
error:port=1.
Oct 27 14:51:38 sherbet kernel: sd 4:0:2:0: [sdc] Result: hostbyte=DID_OK driver
byte=DRIVER_SENSE,SUGGEST_OK
Oct 27 14:51:38 sherbet kernel: sd 4:0:2:0: [sdc] Sense Key : Medium Error [curr
ent] 
Oct 27 14:51:38 sherbet kernel: sd 4:0:2:0: [sdc] Add. Sense: Unrecovered read e
rror
Oct 27 14:51:38 sherbet kernel: end_request: I/O error, dev sdc, sector 42383438
Oct 27 14:51:38 sherbet kernel: RAID1 conf printout:
Oct 27 14:51:38 sherbet kernel:  --- wd:1 rd:2
Oct 27 14:51:38 sherbet kernel:  disk 0, wo:1, o:1, dev:sdb2
Oct 27 14:51:38 sherbet kernel:  disk 1, wo:0, o:1, dev:sdc2
Oct 27 14:51:38 sherbet kernel: RAID1 conf printout:
Oct 27 14:51:38 sherbet kernel:  --- wd:1 rd:2
Oct 27 14:51:38 sherbet kernel:  disk 1, wo:0, o:1, dev:sdc2
Oct 27 14:51:38 sherbet kernel: RAID1 conf printout:
Oct 27 14:51:38 sherbet kernel:  --- wd:1 rd:2
Oct 27 14:51:38 sherbet kernel:  disk 0, wo:1, o:1, dev:sdb2
Oct 27 14:51:38 sherbet kernel:  disk 1, wo:0, o:1, dev:sdc2
Oct 27 14:51:38 sherbet kernel: md: delaying recovery of md1 until md2 has finished (they share one or more physical units)
Oct 27 14:51:38 sherbet kernel: 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=1.
Oct 27 14:51:38 sherbet kernel: sd 4:0:2:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Oct 27 14:51:38 sherbet kernel: sd 4:0:2:0: [sdc] Sense Key : Medium Error [current] 
Oct 27 14:51:38 sherbet kernel: sd 4:0:2:0: [sdc] Add. Sense: Unrecovered read error
Oct 27 14:51:38 sherbet kernel: end_request: I/O error, dev sdc, sector 42381222
Oct 27 14:51:38 sherbet kernel: raid1: sdc: unrecoverable I/O read error for block 1664
Oct 27 14:51:39 sherbet kernel: 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=1.
Oct 27 14:51:39 sherbet kernel: sd 4:0:2:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Oct 27 14:51:39 sherbet kernel: sd 4:0:2:0: [sdc] Sense Key : Medium Error [current] 
Oct 27 14:51:39 sherbet kernel: sd 4:0:2:0: [sdc] Add. Sense: Unrecovered read error
Oct 27 14:51:39 sherbet kernel: end_request: I/O error, dev sdc, sector 42383574
Oct 27 14:51:39 sherbet kernel: raid1: sdc: unrecoverable I/O read error for block 4096
Oct 27 14:51:40 sherbet kernel: md: md2: recovery done.
Oct 27 14:51:40 sherbet kernel: md: recovery of RAID array md1
Oct 27 14:51:40 sherbet kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Oct 27 14:51:40 sherbet kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Oct 27 14:51:40 sherbet kernel: md: using 128k window, over a total of 20972784 blocks.

This repeats ad nauseum…

Hardware is a 3ware controller (model is 3w-9xxx, not sure exactly). Three 1Tb drives from three different manufacturers (to allay drive design faults).

Hi
To me that looks like you have two faulty drives sda and sdc, use
smartctl to check ALL the drive error stats. If sdb looks good, buy
another two of those for your array…


smartctl -a /dev/sda
smartctl -a /dev/sdb
smartctl -a /dev/sdc


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.4 (x86_64) Kernel 2.6.37.6-0.7-desktop
up 17:51, 3 users, load average: 0.00, 0.03, 0.06
GPU GeForce 8600 GTS Silent - Driver Version: 285.05.09

Thanks DD. Yes, you could be right, maybe should have gone for SLES (we have other machines running this) – can’t remember the reason for choosing OpenSuse at the time but I will say that if you get a stable version of OpenSuse, it really cuts the mustard.

Thanks Malcolm, I’ll continue on Monday. I just don’t think it’s likely that two drives from two manufacturers would fail at the same time. I think controller failure is more likely. Plan is now to move the data from backup to a CIFS share and then I can take my sick server down for offline troubleshooting. I’ll post whatever I find.

Good weekend all.

Hi guys, thanks for all your input.

It was the controller (3Ware 9500-4LP SATA RAID controller) that had failed. In the process it seems to have corrupted the data on both the active drives. I’ve replaced the 3Ware card and am about to reformat, rebuild and restore the data.

Thanks again,

David.