I’ve got an 8.4TiB raid6 array with 30 fibre channel drives which was was working fairly well up until a few weeks ago. At that time, it failed and as I only had remote access I shut down the server and drive arrays with the help of a friend.
I noticed that there are 3 failed drives, in an array with a redundancy of 2 (clearly a problem). I tried doing mdadm --assemble --force on the drives, which only says
mdadm: /dev/md/md0 assembled from 27 drives - not enough to start the array.
How helpful.
Examining the event counts reveals:
ark:/var/log # mdadm --examine /dev/sd[c-z] /dev/sda[a-f] | egrep 'Event|/dev/sd'
/dev/sdc:
Events : 42661
/dev/sdd:
Events : 42661
/dev/sde:
Events : 42661
/dev/sdf:
Events : 42661
/dev/sdg:
Events : 42661
/dev/sdh:
Events : 42661
/dev/sdi:
Events : 42661
/dev/sdj:
Events : 42661
/dev/sdk:
Events : 42661
/dev/sdl:
Events : 42661
/dev/sdm:
Events : 42661
/dev/sdn:
Events : 42661
/dev/sdo:
Events : 42661
/dev/sdp:
Events : 42661
/dev/sdq:
Events : 42661
/dev/sdr:
Events : 19
/dev/sds:
Events : 41915
/dev/sdt:
Events : 42661
/dev/sdu:
Events : 42661
/dev/sdv:
Events : 42661
/dev/sdw:
Events : 42661
/dev/sdx:
Events : 42661
/dev/sdy:
Events : 42661
/dev/sdz:
Events : 42661
/dev/sdaa:
Events : 42661
/dev/sdab:
Events : 42661
/dev/sdac:
Events : 42661
/dev/sdad:
Events : 19
/dev/sdae:
Events : 42661
/dev/sdaf:
Events : 42661
One drive is down by about 500 (out of 42661) and two are all the way down at 19. Their update date as per mdadm --examine is two weeks previous to the rest, though considerably after I created the array. Not sure why this is the case, as the array has never shown any failed drives or errors assembling or starting. I believe the final failed drive (with 500 or so less events) was last updated before any writes were done to the array, so I would think that any data is still probably there.
smartctl also reveals nothing unusual - all drives appear as healthy.
What might I be missing here? I feel like it’s got to be something simple. I would hope that it can be reassembled (obviously).
What should/can I do about the low event drives? That seems wrong no matter what.