Page 2 of 2 FirstFirst 12
Results 11 to 18 of 18

Thread: Software RAID rebuilds continuously

  1. #11
    Join Date
    Jun 2008
    Location
    Podunk
    Posts
    26,796
    Blog Entries
    15

    Default Re: Software RAID rebuilds continuously

    Hi
    I would offline the faulty device as well, maybe that's causing it to rebuild?
    Cheers Malcolm °¿° SUSE Knowledge Partner (Linux Counter #276890)
    SUSE SLE, openSUSE Leap/Tumbleweed (x86_64) | GNOME DE
    If you find this post helpful and are logged into the web interface,
    please show your appreciation and click on the star below... Thanks!

  2. #12
    Join Date
    Jun 2008
    Location
    Earth - Denmark
    Posts
    10,730

    Default Re: Software RAID rebuilds continuously

    On 10/26/2011 06:26 PM, wickedmonkey wrote:
    > I'm wary about blindly
    > applying updates where they're not needed -- especially if something's
    > working well.


    don't get me wrong...i cling to "If it ain't broke don't fix it" like
    white on rice..

    the thing is, imHo (which others here probably may not agree with)
    openSUSE is a poor choice for any commercial or private user who wishes
    to run "perfectly for 3 to 4 years non-stop" because of the short life
    of openSUSE dictates a upgrade or new install about every 15 to 18 months..

    for example the most recent supported version (11.4) dies on September
    15th 2012 cite <http://en.opensuse.org/Lifetime> after being first
    released on March 10th 2011..

    --
    DD
    openSUSE®, the "German Automobiles" of operating systems

  3. #13

    Default Re: Software RAID rebuilds continuously

    Results from the monitor command for /dev/md1:

    Code:
    Oct 26 21:22:10: Rebuild40 on /dev/md1 unknown device
    Oct 26 21:23:10: Rebuild60 on /dev/md1 unknown device
    Oct 26 21:25:10: Rebuild80 on /dev/md1 unknown device
    Oct 26 21:25:53: RebuildFinished on /dev/md1 unknown device
    Oct 26 21:26:07: RebuildStarted on /dev/md1 unknown device
    Oct 26 21:28:07: Rebuild20 on /dev/md1 unknown device
    Oct 26 21:30:07: Rebuild40 on /dev/md1 unknown device
    Oct 26 21:31:07: Rebuild60 on /dev/md1 unknown device
    Oct 26 21:33:07: Rebuild80 on /dev/md1 unknown device
    Oct 26 21:33:51: RebuildFinished on /dev/md1 unknown device
    Oct 26 21:34:08: RebuildStarted on /dev/md1 unknown device
    Oct 26 21:36:08: Rebuild20 on /dev/md1 unknown device
    Oct 26 21:38:08: Rebuild40 on /dev/md1 unknown device
    Oct 26 21:39:08: Rebuild60 on /dev/md1 unknown device
    Oct 26 21:41:08: Rebuild80 on /dev/md1 unknown device
    Oct 26 21:41:43: RebuildFinished on /dev/md1 unknown device
    Oct 26 21:41:48: RebuildStarted on /dev/md1 unknown device
    Oct 26 21:43:48: Rebuild20 on /dev/md1 unknown device
    Oct 26 21:45:48: Rebuild40 on /dev/md1 unknown device
    Oct 26 21:46:48: Rebuild60 on /dev/md1 unknown device
    Oct 26 21:48:48: Rebuild80 on /dev/md1 unknown device
    Oct 26 21:49:38: RebuildFinished on /dev/md1 unknown device
    Oct 26 21:49:44: RebuildStarted on /dev/md1 unknown device
    Oct 26 21:51:44: Rebuild20 on /dev/md1 unknown device
    Oct 26 21:53:44: Rebuild40 on /dev/md1 unknown device
    Oct 26 21:54:44: Rebuild60 on /dev/md1 unknown device
    Oct 26 21:56:44: Rebuild80 on /dev/md1 unknown device
    Oct 26 21:57:31: RebuildFinished on /dev/md1 unknown device
    Oct 26 21:57:36: RebuildStarted on /dev/md1 unknown device
    Oct 26 21:59:36: Rebuild20 on /dev/md1 unknown device
    Oct 26 22:01:36: Rebuild40 on /dev/md1 unknown device
    Oct 26 22:02:36: Rebuild60 on /dev/md1 unknown device
    Oct 26 22:04:36: Rebuild80 on /dev/md1 unknown device
    Oct 26 22:05:16: RebuildFinished on /dev/md1 unknown device
    Oct 26 22:05:22: RebuildStarted on /dev/md1 unknown device
    Oct 26 22:07:22: Rebuild20 on /dev/md1 unknown device
    Oct 26 22:08:22: Rebuild40 on /dev/md1 unknown device
    Oct 26 22:10:22: Rebuild60 on /dev/md1 unknown device
    Oct 26 22:11:22: Rebuild80 on /dev/md1 unknown device
    Oct 26 22:13:01: RebuildFinished on /dev/md1 unknown device
    Oct 26 22:13:07: RebuildStarted on /dev/md1 unknown device
    Oct 26 22:15:07: Rebuild20 on /dev/md1 unknown device
    Oct 26 22:17:07: Rebuild40 on /dev/md1 unknown device
    Oct 26 22:18:07: Rebuild60 on /dev/md1 unknown device
    Oct 26 22:19:07: Rebuild80 on /dev/md1 unknown device
    Oct 26 22:20:40: RebuildFinished on /dev/md1 unknown device
    Oct 26 22:20:47: RebuildStarted on /dev/md1 unknown device
    Oct 26 22:22:47: Rebuild20 on /dev/md1 unknown device
    Oct 26 22:24:47: Rebuild40 on /dev/md1 unknown device
    Oct 26 22:25:47: Rebuild60 on /dev/md1 unknown device
    Oct 26 22:27:47: Rebuild80 on /dev/md1 unknown device
    Oct 26 22:28:21: RebuildFinished on /dev/md1 unknown device
    Oct 26 22:28:28: RebuildStarted on /dev/md1 unknown device
    Oct 26 22:30:28: Rebuild20 on /dev/md1 unknown device
    Oct 26 22:32:28: Rebuild40 on /dev/md1 unknown device
    Oct 26 22:33:28: Rebuild60 on /dev/md1 unknown device
    Oct 26 22:35:28: Rebuild80 on /dev/md1 unknown device
    Oct 26 22:36:07: RebuildFinished on /dev/md1 unknown device
    Oct 26 22:36:15: RebuildStarted on /dev/md1 unknown device
    Oct 26 22:38:16: Rebuild20 on /dev/md1 unknown device
    Oct 26 22:40:16: Rebuild40 on /dev/md1 unknown device
    Oct 26 22:41:16: Rebuild60 on /dev/md1 unknown device
    Oct 26 22:42:16: Rebuild80 on /dev/md1 unknown device
    Oct 26 22:43:53: RebuildFinished on /dev/md1 unknown device
    Oct 26 22:44:03: RebuildStarted on /dev/md1 unknown device
    Oct 26 22:46:03: Rebuild20 on /dev/md1 unknown device
    Oct 26 22:48:03: Rebuild40 on /dev/md1 unknown device
    Oct 26 22:49:03: Rebuild60 on /dev/md1 unknown device
    Oct 26 22:50:03: Rebuild80 on /dev/md1 unknown device
    Oct 26 22:51:37: RebuildFinished on /dev/md1 unknown device
    Oct 26 22:51:41: RebuildStarted on /dev/md1 unknown device
    Oct 26 22:53:41: Rebuild20 on /dev/md1 unknown device
    Oct 26 22:55:41: Rebuild40 on /dev/md1 unknown device
    Oct 26 22:56:41: Rebuild60 on /dev/md1 unknown device
    Oct 26 22:58:41: Rebuild80 on /dev/md1 unknown device
    Oct 26 22:59:17: RebuildFinished on /dev/md1 unknown device
    Oct 26 22:59:25: RebuildStarted on /dev/md1 unknown device
    Oct 26 23:01:26: Rebuild20 on /dev/md1 unknown device
    Oct 26 23:03:26: Rebuild40 on /dev/md1 unknown device
    Oct 26 23:04:26: Rebuild60 on /dev/md1 unknown device
    Oct 26 23:06:26: Rebuild80 on /dev/md1 unknown device
    Oct 26 23:07:09: RebuildFinished on /dev/md1 unknown device
    Oct 26 23:07:13: RebuildStarted on /dev/md1 unknown device
    Oct 26 23:09:13: Rebuild20 on /dev/md1 unknown device
    Oct 26 23:11:13: Rebuild40 on /dev/md1 unknown device
    Oct 26 23:12:13: Rebuild60 on /dev/md1 unknown device
    Oct 26 23:14:13: Rebuild80 on /dev/md1 unknown device
    Oct 26 23:14:59: RebuildFinished on /dev/md1 unknown device
    Oct 26 23:15:04: RebuildStarted on /dev/md1 unknown device
    Oct 26 23:17:04: Rebuild20 on /dev/md1 unknown device
    Oct 26 23:19:04: Rebuild40 on /dev/md1 unknown device
    Oct 26 23:20:04: Rebuild60 on /dev/md1 unknown device
    Oct 26 23:22:04: Rebuild80 on /dev/md1 unknown device
    Oct 26 23:22:50: RebuildFinished on /dev/md1 unknown device
    Oct 26 23:22:54: RebuildStarted on /dev/md1 unknown device
    Oct 26 23:24:54: Rebuild20 on /dev/md1 unknown device
    Oct 26 23:26:54: Rebuild40 on /dev/md1 unknown device
    Oct 26 23:27:54: Rebuild60 on /dev/md1 unknown device
    Oct 26 23:29:54: Rebuild80 on /dev/md1 unknown device
    Oct 26 23:30:41: RebuildFinished on /dev/md1 unknown device
    Oct 26 23:30:46: RebuildStarted on /dev/md1 unknown device
    Oct 26 23:32:46: Rebuild20 on /dev/md1 unknown device
    Oct 26 23:34:46: Rebuild40 on /dev/md1 unknown device
    Oct 26 23:35:46: Rebuild60 on /dev/md1 unknown device
    Oct 26 23:37:46: Rebuild80 on /dev/md1 unknown device
    Oct 26 23:38:32: RebuildFinished on /dev/md1 unknown device
    Oct 26 23:38:40: RebuildStarted on /dev/md1 unknown device
    ...etc, etc, etc...

    I've heard that re-syncing can take a long time but this doesn't look right to me and it's been going for over 50 hours now.

    Looking at the logs, I'm seeing read errors on the other active disc as well so, as all three discs are different brands, I'm thinking that the controller is on it's way out.

  4. #14

    Default Re: Software RAID rebuilds continuously

    wickedmonkey wrote:
    > Looking at the logs, I'm seeing read errors on the other active disc as
    > well so, as all three discs are different brands, I'm thinking that the
    > controller is on it's way out.


    If it is, there should be libata error messages in /var/log/messages.

    [ Though again, there've been lots of changes/bugfixes/improvements over
    time, so maybe your system doesn't log them in the same way. ]

    And if you are seeing error messages, then that's a very different
    situation to mdadm mysteriously failing to rebuild an array.

    Perhaps you should post your hardware details and the log messages that
    you do see.

  5. #15

    Default Re: Software RAID rebuilds continuously

    Sorry for the delay: t'Interweb issues.

    No libata messages in /var/log/messages. The only thing I can see that may be relevant is the following:

    Code:
    Oct 27 14:51:36 sherbet kernel: 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC 
    error:port=1.
    Oct 27 14:51:36 sherbet kernel: sd 4:0:2:0: [sdc] Result: hostbyte=DID_OK driver
    byte=DRIVER_SENSE,SUGGEST_OK
    Oct 27 14:51:36 sherbet kernel: sd 4:0:2:0: [sdc] Sense Key : Medium Error [curr
    ent] 
    Oct 27 14:51:36 sherbet kernel: sd 4:0:2:0: [sdc] Add. Sense: Unrecovered read e
    rror
    Oct 27 14:51:36 sherbet kernel: end_request: I/O error, dev sdc, sector 42381006
    Oct 27 14:51:36 sherbet kernel: 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC 
    error:port=1.
    Oct 27 14:51:38 sherbet kernel: sd 4:0:2:0: [sdc] Result: hostbyte=DID_OK driver
    byte=DRIVER_SENSE,SUGGEST_OK
    Oct 27 14:51:38 sherbet kernel: sd 4:0:2:0: [sdc] Sense Key : Medium Error [curr
    ent] 
    Oct 27 14:51:38 sherbet kernel: sd 4:0:2:0: [sdc] Add. Sense: Unrecovered read e
    rror
    Oct 27 14:51:38 sherbet kernel: end_request: I/O error, dev sdc, sector 42383438
    Oct 27 14:51:38 sherbet kernel: RAID1 conf printout:
    Oct 27 14:51:38 sherbet kernel:  --- wd:1 rd:2
    Oct 27 14:51:38 sherbet kernel:  disk 0, wo:1, o:1, dev:sdb2
    Oct 27 14:51:38 sherbet kernel:  disk 1, wo:0, o:1, dev:sdc2
    Oct 27 14:51:38 sherbet kernel: RAID1 conf printout:
    Oct 27 14:51:38 sherbet kernel:  --- wd:1 rd:2
    Oct 27 14:51:38 sherbet kernel:  disk 1, wo:0, o:1, dev:sdc2
    Oct 27 14:51:38 sherbet kernel: RAID1 conf printout:
    Oct 27 14:51:38 sherbet kernel:  --- wd:1 rd:2
    Oct 27 14:51:38 sherbet kernel:  disk 0, wo:1, o:1, dev:sdb2
    Oct 27 14:51:38 sherbet kernel:  disk 1, wo:0, o:1, dev:sdc2
    Oct 27 14:51:38 sherbet kernel: md: delaying recovery of md1 until md2 has finished (they share one or more physical units)
    Oct 27 14:51:38 sherbet kernel: 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=1.
    Oct 27 14:51:38 sherbet kernel: sd 4:0:2:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
    Oct 27 14:51:38 sherbet kernel: sd 4:0:2:0: [sdc] Sense Key : Medium Error [current] 
    Oct 27 14:51:38 sherbet kernel: sd 4:0:2:0: [sdc] Add. Sense: Unrecovered read error
    Oct 27 14:51:38 sherbet kernel: end_request: I/O error, dev sdc, sector 42381222
    Oct 27 14:51:38 sherbet kernel: raid1: sdc: unrecoverable I/O read error for block 1664
    Oct 27 14:51:39 sherbet kernel: 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=1.
    Oct 27 14:51:39 sherbet kernel: sd 4:0:2:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
    Oct 27 14:51:39 sherbet kernel: sd 4:0:2:0: [sdc] Sense Key : Medium Error [current] 
    Oct 27 14:51:39 sherbet kernel: sd 4:0:2:0: [sdc] Add. Sense: Unrecovered read error
    Oct 27 14:51:39 sherbet kernel: end_request: I/O error, dev sdc, sector 42383574
    Oct 27 14:51:39 sherbet kernel: raid1: sdc: unrecoverable I/O read error for block 4096
    Oct 27 14:51:40 sherbet kernel: md: md2: recovery done.
    Oct 27 14:51:40 sherbet kernel: md: recovery of RAID array md1
    Oct 27 14:51:40 sherbet kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
    Oct 27 14:51:40 sherbet kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
    Oct 27 14:51:40 sherbet kernel: md: using 128k window, over a total of 20972784 blocks.
    This repeats ad nauseum...

    Hardware is a 3ware controller (model is 3w-9xxx, not sure exactly). Three 1Tb drives from three different manufacturers (to allay drive design faults).

  6. #16
    Join Date
    Jun 2008
    Location
    Podunk
    Posts
    26,796
    Blog Entries
    15

    Default Re: Software RAID rebuilds continuously

    Quote Originally Posted by wickedmonkey
    Sorry for the delay: t'Interweb issues.

    No libata messages in /var/log/messages. The only thing I can see that
    may be relevant is the following:

    Code:
    Oct 27 14:51:36 sherbet kernel: sd 4:0:2:0: [sdc] Add. Sense: Unrecovered read error
    Oct 27 14:51:36 sherbet kernel: end_request: I/O error, dev sdc, sector 42381006
    This repeats ad nauseum...

    Hardware is a 3ware controller (model is 3w-9xxx, not sure exactly).
    Three 1Tb drives from three different manufacturers (to allay drive
    design faults).

    Hi
    To me that looks like you have two faulty drives sda and sdc, use
    smartctl to check ALL the drive error stats. If sdb looks good, buy
    another two of those for your array....
    Code:
    smartctl -a /dev/sda
    smartctl -a /dev/sdb
    smartctl -a /dev/sdc
    --
    Cheers Malcolm °¿° (Linux Counter #276890)
    openSUSE 11.4 (x86_64) Kernel 2.6.37.6-0.7-desktop
    up 17:51, 3 users, load average: 0.00, 0.03, 0.06
    GPU GeForce 8600 GTS Silent - Driver Version: 285.05.09


  7. #17

    Default Re: Software RAID rebuilds continuously

    Thanks DD. Yes, you could be right, maybe should have gone for SLES (we have other machines running this) -- can't remember the reason for choosing OpenSuse at the time but I will say that if you get a stable version of OpenSuse, it really cuts the mustard.

    Thanks Malcolm, I'll continue on Monday. I just don't think it's likely that two drives from two manufacturers would fail at the same time. I think controller failure is more likely. Plan is now to move the data from backup to a CIFS share and then I can take my sick server down for offline troubleshooting. I'll post whatever I find.

    Good weekend all.

  8. #18

    Default Re: Software RAID rebuilds continuously

    Hi guys, thanks for all your input.

    It was the controller (3Ware 9500-4LP SATA RAID controller) that had failed. In the process it seems to have corrupted the data on both the active drives. I've replaced the 3Ware card and am about to reformat, rebuild and restore the data.

    Thanks again,

    David.

Page 2 of 2 FirstFirst 12

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •