md raid6 array can't be forced to re-assemble

I’ve got an 8.4TiB raid6 array with 30 fibre channel drives which was was working fairly well up until a few weeks ago. At that time, it failed and as I only had remote access I shut down the server and drive arrays with the help of a friend.

I noticed that there are 3 failed drives, in an array with a redundancy of 2 (clearly a problem). I tried doing mdadm --assemble --force on the drives, which only says

 mdadm: /dev/md/md0 assembled from 27 drives - not enough to start the array.

How helpful.

Examining the event counts reveals:

ark:/var/log # mdadm --examine /dev/sd[c-z] /dev/sda[a-f] | egrep 'Event|/dev/sd'
/dev/sdc:
         Events : 42661
/dev/sdd:
         Events : 42661
/dev/sde:
         Events : 42661
/dev/sdf:
         Events : 42661
/dev/sdg:
         Events : 42661
/dev/sdh:
         Events : 42661
/dev/sdi:
         Events : 42661
/dev/sdj:
         Events : 42661
/dev/sdk:
         Events : 42661
/dev/sdl:
         Events : 42661
/dev/sdm:
         Events : 42661
/dev/sdn:
         Events : 42661
/dev/sdo:
         Events : 42661
/dev/sdp:
         Events : 42661
/dev/sdq:
         Events : 42661
/dev/sdr:
         Events : 19
/dev/sds:
         Events : 41915
/dev/sdt:
         Events : 42661
/dev/sdu:
         Events : 42661
/dev/sdv:
         Events : 42661
/dev/sdw:
         Events : 42661
/dev/sdx:
         Events : 42661
/dev/sdy:
         Events : 42661
/dev/sdz:
         Events : 42661
/dev/sdaa:
         Events : 42661
/dev/sdab:
         Events : 42661
/dev/sdac:
         Events : 42661
/dev/sdad:
         Events : 19
/dev/sdae:
         Events : 42661
/dev/sdaf:
         Events : 42661

One drive is down by about 500 (out of 42661) and two are all the way down at 19. Their update date as per mdadm --examine is two weeks previous to the rest, though considerably after I created the array. Not sure why this is the case, as the array has never shown any failed drives or errors assembling or starting. I believe the final failed drive (with 500 or so less events) was last updated before any writes were done to the array, so I would think that any data is still probably there.

smartctl also reveals nothing unusual - all drives appear as healthy.

What might I be missing here? I feel like it’s got to be something simple. I would hope that it can be reassembled (obviously).
What should/can I do about the low event drives? That seems wrong no matter what.

You may try some creative combination of --create, --assume-clean and “missing”. But you may not have second chance …

Well, I got it to reassemble. Very interesting process. Still wondering what caused this - and what I can do to prevent it.

I didn’t initially want to attempt to do mdadm --create, since 27 drives had no issues (and one didn’t seem that far off). Instead, I did some manual superblock editing.
http://raid.wiki.kernel.org/index.php/RAID_superblock_formats details the format and location of the superblock. I dd’ed the superblock of both a “good” drive and the best “failed” drive, and edited the update time and event count of the failed drive to match the good one. After applying an overlay to the failed drive as per http://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID I dd’ed the modified superblock back, and tried to assemble. Didn’t work. mdadm --examine revealed that the superblock checksum was wrong - and conveniently provided the appropriate checksum. After correcting this, I was able to reassemble the array. It is currently listed as “clean, degraded” and is read-only for the time being. fsck -nf doesn’t report any problems, though I will be checking some files by hand to ensure their integrity.

A few questions:
1.) Is it safe to replace the two completely failed drives (with event counts of 19) and rebuild? Will this change any data on the non-failed drives? Will it update the superblocks (which currently (erroneously) list 3 failed drives)?
2.) Is there anything which fsck -nf wouldn’t have found? Again, I hadn’t written to the array after the “failure” of the third drive, though the update time was behind by three days. I have the ability to copy any and all data to new hard drives, even directly image every drive in the array if necessary, but one of this size would take a while.
3.) Why weren’t two drives being updated? I most definitely had no marked spares, and the array was not reported to be degraded before this happened.

On a related note, is this typical for mdadm? I initially chose it due to it’s maturity and integration with the kernel, but I’m free to suggestions for other, more reliable software RAID setups.

You can only lose 2 disks out of the array and still expect it to function. Loss of the third killed the array.

But that raises the question what kind of disks are you using that seem to fall over so often??

On 2014-01-17 01:56, gogalthorp wrote:
>
> You can only lose 2 disks out of the array and still expect it to
> function. Loss of the third killed the array.
>
> But that raises the question what kind of disks are you using that seem
> to fall over so often??

I think I read about some type of disks or some hardware config that
removed the disks out of an array for seemingly trivial reasons.

Ah, found it, perhaps. Hard disks can be labeled “raid edition”.

http://lists.opensuse.org/opensuse/2013-08/msg00732.html


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

I understand that the third loss killed it. Rather, I’m wondering what happened to the first two (with only 19 events? I had written over a TB at that time, that can’t be right) when they weren’t reported as failed, and the array was reported as healthy. As far as disks, they’re older fibre channel disks which were previously used in a giant EMC rack setup, but as far as I can tell they didn’t actually fail as smartctl -a doesn’t report anything out of the ordinary.

As above, these were previously used in an EMC data rack and therefore I’d assume this is precisely what they’re intended for. While mdadm isn’t quite FLARE, I would assume parity-based RAID is fairly similar in overall hardware demands between implementations.

Is it safe to reassemble as read-write and rebuild, though? I tested several files (all of my FLACs and a movie rip) and nothing came up bad. However, the superblocks are still slightly inconsistent I believe (non-failed drives report 3 drives failed, the one that I modified the superblock for only reports two), and I’d like to ensure no data corruption occurs (since it’s currently intact, just read-only).

Best I can do is give some references

http://www.techrepublic.com/blog/the-enterprise-cloud/how-to-protect-yourself-from-raid-related-unrecoverable-read-errors-ures/

Here is a nice discussion
http://forums.storagereview.com/index.php/topic/34094-is-raid-56-dead-due-to-large-drive-capacities/

It should not.

Will it update the superblocks (which currently (erroneously) list 3 failed drives)?

You mean “superblock on replacement drives”? So you intend to “replace” failed drives with the same drives, without even investigating what happened? Good luck :slight_smile:

2.) Is there anything which fsck -nf wouldn’t have found?

Sure. The actual user data corruption.

3.) Why weren’t two drives being updated?

You have the system and you have the logs. How can we know it? If you go for single RAID array of 30 drives (and that’s without spare), you better make sure you check its state often.

with only 19 events?

Event count changes when RAID state changes. This would imply something bad happened to array (you did not permanently add or remove drives, right?) 19 times already. Like drives temporarily dropping off and being rebuilt or similar. That’s not “only”, that “the whole” 19 times you missed some problem.

Sorry, I wasn’t specific. I was wondering if it would update the non-failed drives, or still think that there were 3 failed drives. It appears that mdadm merely write to the array state portion of the superblock for operator information, rather than reading array state from it.

Sure. The actual user data corruption.

Is there any particular test I can do to check data integrity then? I have checked a single 10GB contiguous file, which has no errors, and my FLAC collection, which has several thousand 30-60MB files all with individual checksums. I would assume that since data is striped across the array that corruption of a single drive would cause corruption of all files, and therefore this would indicate a low likelihood of array-wide corruption, but I’m not really the expert.

You have the system and you have the logs. How can we know it? If you go for single RAID array of 30 drives (and that’s without spare), you better make sure you check its state often.

Event count changes when RAID state changes. This would imply something bad happened to array (you did not permanently add or remove drives, right?) 19 times already. Like drives temporarily dropping off and being rebuilt or similar. That’s not “only”, that “the whole” 19 times you missed some problem.

Sorry, again not quite as specific as I could be. I should say, what would cause two drives to be left behind like that? As I said, I had no errors previous to the third failure - the day before the failure the array reported no errors or failures. I don’t see how I can be expected to check the state if it’s not being reported correctly.
As far as event counts go, I was under the impression that any write caused it to increment. Otherwise my array is truly horrible, as the non-failed drives report counts in the 42000s.

To be perfectly clear, I realize I’m a clueless noob. This is unfortunately the best solution I can afford (the drives and racks were free) right now for large data storage, and the help is very much appreciated. The array is rebuilding (with new(er) drives), it seems to be going fine, but I’ll have to see if it’s still ok after a few days.

On 2014-01-17 16:26, gjsmo wrote:
> To be perfectly clear, I realize I’m a clueless noob. This is
> unfortunately the best solution I can afford (the drives and racks were
> free) right now for large data storage, and the help is very much
> appreciated. The array is rebuilding (with new(er) drives), it seems to
> be going fine, but I’ll have to see if it’s still ok after a few days.

If I need to store large amounts of data I use separate drives,
different mount places, instead of combining several disks in one array.
Danger is less.

Unfortunately, I don’t know of a way to automatically distribute files
from several directories to appear under the same directory. A
distributed filesystem.


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

Is that not what LVM is for? You can stitch multiple LVM containers together to make one huge file system

On 2014-01-18 03:56, gogalthorp wrote:

>> Unfortunately, I don’t know of a way to automatically distribute files
>> from several directories to appear under the same directory. A
>> distributed filesystem.

> Is that not what LVM is for? You can stitch multiple LVM containers
> together to make one huge file system

No, not that way. If one disk dies, you lose the entire setup. My way
you lose “only” the files that were stored in that disk.

Let me see if I can explain my idea.

You have these disks:


/storage/1/
/storage/2/
/storage/3/
/storage/4/
/storage/5/

and there is also


/storage/joined/

When you write a file (the size must be known in advance, to fully
allocate it) it is automatically stored on only one of those disks
(1…5), chosen by the system. But you only need to reference
“/storage/joined/”.

It is an idea, it does not exist, AFAIK.

I think that msdos had something of the sort, with the command “join”. I
don’t remember it well, long ago…
…]
No, I’m mistaken.

http://en.wikipedia.org/wiki/List_of_DOS_commands#JOIN

Maybe “append”.


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

After you tell it to rebuild two failed drives? Of course it will update superblocks during (or after) this process. How else can it keep track of array state?

It appears that mdadm merely write to the array state portion of the superblock for operator information, rather than reading array state from it.

I’m not sure what do you mean, but please notice, that after you manually tampered with superblocks, anything may be possible.

Is there any particular test I can do to check data integrity then?

Your data should have self-integrity features or you will need external checksums to verify them.

I have checked a single 10GB contiguous file, which has no errors, and my FLAC collection, which has several thousand 30-60MB files all with individual checksums. I would assume that since data is striped across the array that corruption of a single drive would cause corruption of all files, and therefore this would indicate a low likelihood of array-wide corruption, but I’m not really the expert.

As soon as third drive failed any write to array was blocked. It means that drive that failed last should match the other 27 drives. Also no rebuild took place. So you can be reasonably confident your data is as good as it was at the moment drive failed. Attempting to “force clean” two drives that failed earlier would be disaster. You need to rebuild them.

Sorry, again not quite as specific as I could be. I should say, what would cause two drives to be left behind like that?

Again - how should we know? HDD firmware quirks, driver bugs, power glitch … anything can be possible. Without logs nobody can say.

As I said, I had no errors previous to the third failure

How do you know? In which form do you expect those errors to appear?

I don’t see how I can be expected to check the state if it’s not being reported correctly.

You want to say that when your two drives failed /proc/mdstat still claimed that all array members were OK? Did you enable mdadmd service? Did you configure it to notify you when something bad happened to array?

As far as event counts go, I was under the impression that any write caused it to increment. Otherwise my array is truly horrible, as the non-failed drives report counts in the 42000s.

I stay corrected. It is not every write, but it is at least every time array is brought online or taken offline, i.e. at least twice on every system boot.

Understood. I’ll be implementing an automatic checksum catalog (stored off the array of course) then, unless there’s a premade solution that anyone knows.

As soon as third drive failed any write to array was blocked. It means that drive that failed last should match the other 27 drives. Also no rebuild took place. So you can be reasonably confident your data is as good as it was at the moment drive failed. Attempting to “force clean” two drives that failed earlier would be disaster. You need to rebuild them.

As it is all three drives have been replaced. The one that was apparently “ok” got dd’ed onto a new drive after being repaired, and the other two were rebuilt from scratch onto new drives following that.

Again - how should we know? HDD firmware quirks, driver bugs, power glitch … anything can be possible. Without logs nobody can say.

How do you know? In which form do you expect those errors to appear?

You want to say that when your two drives failed /proc/mdstat still claimed that all array members were OK? Did you enable mdadmd service? Did you configure it to notify you when something bad happened to array?

Well I’d expect mdadm to report an error in either dmesg or /proc/mdstat. And yes, as of the last working date prior to the third failure, /proc/mdstat showed a non-degraded clean array - as in, over two weeks over two drives apparently failed. As far as notifications, I haven’t configured them as I check manually every day along with various other admin tasks. Not necessarily the best idea, I realize, but I don’t see how that’s different than a notification, unless mdstat is updated separate from mdadm’s internal state. As far as logs, I’ll look over them again, but I was under the impression that anything logged would also appear in mdstat.

Seems like I’ve averted disaster this time, but it looks like I’ll need to start being more proactive about this, especially using used hard drives.

If this is indeed true, this would be major bug in MD implementation. Do you have any evidence left? If yes, you need to report this by all means.

On 2014-01-19 13:16, arvidjaar wrote:
>
> gjsmo;2617338 Wrote:
>> And yes, as of the last working date prior to the third failure,
>> /proc/mdstat showed a non-degraded clean array - as in, over two weeks
>> over two drives apparently failed.
>
> If this is indeed true, this would be major bug in MD implementation. Do
> you have any evidence left? If yes, you need to report this by all
> means.

There is a daemon that watches and email changes.


Cheers / Saludos,

Carlos E. R.
(from 12.3 x86_64 “Dartmouth” at Telcontar)

I doubt I have enough evidence. Lack of evidence is not evidence of lacking. I don’t have a copy of /proc/mdstat from the appropriate days, which I was using to monitor the array. I’ve got superblocks, and the archived /var/log/messages from the appropriate day. Only the superblocks show evidence of the first failure (on December 12th), while the log shows nothing until the date of array failure.

I’m going to chalk this up to wonky drives for now. I have had a few drives which failed as soon as plugged in, in the array building process, and one soon after the array was built. I certainly don’t prefer it, but I could afford them (read: free) and at the moment I don’t have a better solution than to weed out the bad ones. As for now, 27 drives have been running for 3 months, so it seems the majority are good.

On the subject of spares, how many spares are appropriate for an array of this size? Would it be ok for the moment (until I add more drives) to use loop device on my (mostly unused) 1TB boot drive (considering the 300GB array drives)?

may I ask a question(or two) and suggest something?

#1. is this a business Storage?
#2. is this server only used for storage aka NAS device or does it runs something else on it?

if it is a business storage you need to make some decisions here, otherwise
take a look at UnRaid (http://lime-technology.com).
there are other similar solutions of course, but unRaid have its advantages in comparison, and giving that there are some changes coming into the light soon it might be a good solutions.

in summary : unRaid is a NAS type server OS. it has no desktop GUI and is operated via web interface only.
it is installed and run from USB stick. very light weight and mostly older hardware friendly(as a bare bone install it can easily run on half a gig of RAM) the only things I am not sure of is if it works with your storage devices but you can easily ask.

it allows you to do just what you describe:
build a protected Raid like array using a bunch of disks, but present it as a single volume to the user
all files will be distributed over all the disks and if you loose 2 disk only data on those 2 disks would be lost.
the only issue at this time is it only allows single parity disk so( RAID4 type config) thus if you loose 2 drives you loose the data on them. but a single drive failure could be recovered from.
check it out.

Not quite what I’m looking for, I need a general purpose OS with capabilities to use various server applications (http, ftp and minecraft servers currently running, with media streaming coming soon). Not a business application, this is at home. I’m happy with running openSUSE (currently at 13.1).

sorry I thought you were runing a NAS type file server.
I currently have an unRaid file server for my xbmc HTPC.
but plan to moove to an OpenSuse /SAMBA setup
using a BTRFS based RAID6 so far have been having issues setting it up, but I also try to setup a VM server on the box as well where my hardware is not 100% KVM/XEN supporting, hence additional headache :slight_smile: