Page 1 of 2 12 LastLast
Results 1 to 10 of 20

Thread: md raid6 array can't be forced to re-assemble

  1. #1

    Default md raid6 array can't be forced to re-assemble

    I've got an 8.4TiB raid6 array with 30 fibre channel drives which was was working fairly well up until a few weeks ago. At that time, it failed and as I only had remote access I shut down the server and drive arrays with the help of a friend.

    I noticed that there are 3 failed drives, in an array with a redundancy of 2 (clearly a problem). I tried doing mdadm --assemble --force on the drives, which only says
    Code:
     mdadm: /dev/md/md0 assembled from 27 drives - not enough to start the array.
    How helpful.

    Examining the event counts reveals:
    Code:
    ark:/var/log # mdadm --examine /dev/sd[c-z] /dev/sda[a-f] | egrep 'Event|/dev/sd'
    /dev/sdc:
             Events : 42661
    /dev/sdd:
             Events : 42661
    /dev/sde:
             Events : 42661
    /dev/sdf:
             Events : 42661
    /dev/sdg:
             Events : 42661
    /dev/sdh:
             Events : 42661
    /dev/sdi:
             Events : 42661
    /dev/sdj:
             Events : 42661
    /dev/sdk:
             Events : 42661
    /dev/sdl:
             Events : 42661
    /dev/sdm:
             Events : 42661
    /dev/sdn:
             Events : 42661
    /dev/sdo:
             Events : 42661
    /dev/sdp:
             Events : 42661
    /dev/sdq:
             Events : 42661
    /dev/sdr:
             Events : 19
    /dev/sds:
             Events : 41915
    /dev/sdt:
             Events : 42661
    /dev/sdu:
             Events : 42661
    /dev/sdv:
             Events : 42661
    /dev/sdw:
             Events : 42661
    /dev/sdx:
             Events : 42661
    /dev/sdy:
             Events : 42661
    /dev/sdz:
             Events : 42661
    /dev/sdaa:
             Events : 42661
    /dev/sdab:
             Events : 42661
    /dev/sdac:
             Events : 42661
    /dev/sdad:
             Events : 19
    /dev/sdae:
             Events : 42661
    /dev/sdaf:
             Events : 42661
    One drive is down by about 500 (out of 42661) and two are all the way down at 19. Their update date as per mdadm --examine is two weeks previous to the rest, though considerably after I created the array. Not sure why this is the case, as the array has never shown any failed drives or errors assembling or starting. I believe the final failed drive (with 500 or so less events) was last updated before any writes were done to the array, so I would think that any data is still probably there.

    smartctl also reveals nothing unusual - all drives appear as healthy.

    What might I be missing here? I feel like it's got to be something simple. I would hope that it can be reassembled (obviously).
    What should/can I do about the low event drives? That seems wrong no matter what.

  2. #2
    Join Date
    Sep 2012
    Posts
    7,469

    Default Re: md raid6 array can't be forced to re-assemble

    You may try some creative combination of --create, --assume-clean and "missing". But you may not have second chance ...

  3. #3

    Default Re: md raid6 array can't be forced to re-assemble

    Well, I got it to reassemble. Very interesting process. Still wondering what caused this - and what I can do to prevent it.

    I didn't initially want to attempt to do mdadm --create, since 27 drives had no issues (and one didn't seem *that* far off). Instead, I did some manual superblock editing.
    http://raid.wiki.kernel.org/index.ph...rblock_formats details the format and location of the superblock. I dd'ed the superblock of both a "good" drive and the best "failed" drive, and edited the update time and event count of the failed drive to match the good one. After applying an overlay to the failed drive as per http://raid.wiki.kernel.org/index.ph..._software_RAID I dd'ed the modified superblock back, and tried to assemble. Didn't work. mdadm --examine revealed that the superblock checksum was wrong - and conveniently provided the appropriate checksum. After correcting this, I was able to reassemble the array. It is currently listed as "clean, degraded" and is read-only for the time being. fsck -nf doesn't report any problems, though I will be checking some files by hand to ensure their integrity.

    A few questions:
    1.) Is it safe to replace the two completely failed drives (with event counts of 19) and rebuild? Will this change any data on the non-failed drives? Will it update the superblocks (which currently (erroneously) list 3 failed drives)?
    2.) Is there anything which fsck -nf wouldn't have found? Again, I hadn't written to the array after the "failure" of the third drive, though the update time was behind by three days. I have the ability to copy any and all data to new hard drives, even directly image every drive in the array if necessary, but one of this size would take a while.
    3.) Why weren't two drives being updated? I most definitely had no marked spares, and the array was not reported to be degraded before this happened.

    On a related note, is this typical for mdadm? I initially chose it due to it's maturity and integration with the kernel, but I'm free to suggestions for other, more reliable software RAID setups.

  4. #4
    Join Date
    Nov 2009
    Location
    West Virginia Sector 13
    Posts
    16,310

    Default Re: md raid6 array can't be forced to re-assemble

    You can only lose 2 disks out of the array and still expect it to function. Loss of the third killed the array.

    But that raises the question what kind of disks are you using that seem to fall over so often??

  5. #5
    Join Date
    Feb 2009
    Location
    Spain
    Posts
    25,547

    Default Re: md raid6 array can't be forced to re-assemble

    On 2014-01-17 01:56, gogalthorp wrote:
    >
    > You can only lose 2 disks out of the array and still expect it to
    > function. Loss of the third killed the array.
    >
    > But that raises the question what kind of disks are you using that seem
    > to fall over so often??


    I think I read about some type of disks or some hardware config that
    removed the disks out of an array for seemingly trivial reasons.

    Ah, found it, perhaps. Hard disks can be labeled "raid edition".

    [URL]http://lists.opensuse.org/opensuse/2013-08/msg00732.html[URL]

    --
    Cheers / Saludos,

    Carlos E. R.
    (from 12.3 x86_64 "Dartmouth" at Telcontar)

  6. #6

    Default Re: md raid6 array can't be forced to re-assemble

    Quote Originally Posted by gogalthorp View Post
    You can only lose 2 disks out of the array and still expect it to function. Loss of the third killed the array.

    But that raises the question what kind of disks are you using that seem to fall over so often??
    I understand that the third loss killed it. Rather, I'm wondering what happened to the first two (with only 19 events? I had written over a TB at that time, that can't be right) when they weren't reported as failed, and the array was reported as healthy. As far as disks, they're older fibre channel disks which were previously used in a giant EMC rack setup, but as far as I can tell they didn't actually fail as smartctl -a doesn't report anything out of the ordinary.

    Quote Originally Posted by robin_listas View Post
    I think I read about some type of disks or some hardware config that
    removed the disks out of an array for seemingly trivial reasons.

    Ah, found it, perhaps. Hard disks can be labeled "raid edition".

    [URL]http://lists.opensuse.org/opensuse/2013-08/msg00732.html[URL]
    As above, these were previously used in an EMC data rack and therefore I'd assume this is precisely what they're intended for. While mdadm isn't quite FLARE, I would assume parity-based RAID is fairly similar in overall hardware demands between implementations.

    Is it safe to reassemble as read-write and rebuild, though? I tested several files (all of my FLACs and a movie rip) and nothing came up bad. However, the superblocks are still slightly inconsistent I believe (non-failed drives report 3 drives failed, the one that I modified the superblock for only reports two), and I'd like to ensure no data corruption occurs (since it's currently intact, just read-only).

  7. #7
    Join Date
    Nov 2009
    Location
    West Virginia Sector 13
    Posts
    16,310

    Default Re: md raid6 array can't be forced to re-assemble


  8. #8
    Join Date
    Sep 2012
    Posts
    7,469

    Default Re: md raid6 array can't be forced to re-assemble

    Quote Originally Posted by gjsmo View Post
    1.)rebuild? Will this change any data on the non-failed drives?
    It should not.
    Will it update the superblocks (which currently (erroneously) list 3 failed drives)?
    You mean "superblock on replacement drives"? So you intend to "replace" failed drives with the same drives, without even investigating what happened? Good luck
    2.) Is there anything which fsck -nf wouldn't have found?
    Sure. The actual user data corruption.

    3.) Why weren't two drives being updated?
    You have the system and you have the logs. How can we know it? If you go for single RAID array of 30 drives (and that's without spare), you better make sure you check its state often.

    with only 19 events?
    Event count changes when RAID state changes. This would imply something bad happened to array (you did not permanently add or remove drives, right?) 19 times already. Like drives temporarily dropping off and being rebuilt or similar. That's not "only", that "the whole" 19 times you missed some problem.

  9. #9

    Default Re: md raid6 array can't be forced to re-assemble

    Quote Originally Posted by arvidjaar View Post
    You mean "superblock on replacement drives"? So you intend to "replace" failed drives with the same drives, without even investigating what happened? Good luck
    Sorry, I wasn't specific. I was wondering if it would update the non-failed drives, or still think that there were 3 failed drives. It appears that mdadm merely write to the array state portion of the superblock for operator information, rather than reading array state from it.

    Sure. The actual user data corruption.
    Is there any particular test I can do to check data integrity then? I have checked a single 10GB contiguous file, which has no errors, and my FLAC collection, which has several thousand 30-60MB files all with individual checksums. I would assume that since data is striped across the array that corruption of a single drive would cause corruption of all files, and therefore this would indicate a low likelihood of array-wide corruption, but I'm not really the expert.

    You have the system and you have the logs. How can we know it? If you go for single RAID array of 30 drives (and that's without spare), you better make sure you check its state often.

    Event count changes when RAID state changes. This would imply something bad happened to array (you did not permanently add or remove drives, right?) 19 times already. Like drives temporarily dropping off and being rebuilt or similar. That's not "only", that "the whole" 19 times you missed some problem.
    Sorry, again not quite as specific as I could be. I should say, what would cause two drives to be left behind like that? As I said, I had no errors previous to the third failure - the day before the failure the array reported no errors or failures. I don't see how I can be expected to check the state if it's not being reported correctly.
    As far as event counts go, I was under the impression that any write caused it to increment. Otherwise my array is truly horrible, as the non-failed drives report counts in the 42000s.


    To be perfectly clear, I realize I'm a clueless noob. This is unfortunately the best solution I can afford (the drives and racks were free) right now for large data storage, and the help is very much appreciated. The array is rebuilding (with new(er) drives), it seems to be going fine, but I'll have to see if it's still ok after a few days.

  10. #10
    Join Date
    Feb 2009
    Location
    Spain
    Posts
    25,547

    Default Re: md raid6 array can't be forced to re-assemble

    On 2014-01-17 16:26, gjsmo wrote:
    > To be perfectly clear, I realize I'm a clueless noob. This is
    > unfortunately the best solution I can afford (the drives and racks were
    > free) right now for large data storage, and the help is very much
    > appreciated. The array is rebuilding (with new(er) drives), it seems to
    > be going fine, but I'll have to see if it's still ok after a few days.


    If I need to store large amounts of data I use separate drives,
    different mount places, instead of combining several disks in one array.
    Danger is less.

    Unfortunately, I don't know of a way to automatically distribute files
    from several directories to appear under the same directory. A
    distributed filesystem.

    --
    Cheers / Saludos,

    Carlos E. R.
    (from 12.3 x86_64 "Dartmouth" at Telcontar)

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •