RAID5 Read Failure on 3 of 4 Drives

As the title implies, things seem really dire at the moment. I’ve got OpenSUSE 13.1 running on a set of RAID 1 Drives (boot, root, home, and swap partitions) and another set of 4 drives in RAID 5 used for data only. SUSE was acting very strange last night, and finally I just decided to reboot the computer… but things never came back!

This was the boot screen I was greeted with: http://paste.opensuse.org/71978326

As you can see, it tries to start the array in a degraded state, but then times out. It then tries to go into Emergency mode, but it never successfully does. Which seems strange as all of the system partitions check out okay, so you would think that the OS would continue to boot. I booted into Rescue mode from a USB key, and ran some smartctl selftests on the drives.

Here’s the test results: http://paste.opensuse.org/98968913

I find it really hard to believe that 3 drives would just come up with read errors without some sort of warning. I check the mdstat weekly to make sure there are no issues. So now I’m not sure what to do… is there a way I can repair around the read errors (fsck somehow?) to at least get this running again? Looks ultimately like I will need to replace the drives.

As always, help is greatly appreciated!

Hi
You should be using mdadm and checking the raid arrays with fsck (eg /dev/md0) not the raid partitions.

I’m having problems with mdadm activating the RAID… booting from the USB Key into Rescue mode I see the following entry in /proc/mdstat:


md127 : inactive sde1[2](S) sdc1[0](S) sdd1[1](S) sdf1[4](S)
             7814053344 blocks super 1.0

I’ll see if I can force it to become active.

Here’s where I’m at now. I had to stop md127 as noted above (mdadm --stop /dev/md127), then ran the following:


Rescue#~ # mdadm --assemble --force /dev/md3 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1
mdadm: forcing event count in /dev/sdf1(3) from 35056 up to 35070
mdadm: clearing FAULTY flag for device 3 in /dev/md3 for /dev/sdf1
mdadm: Marking array /dev/md3 as 'clean'
mdadm: /dev/md3 has been started with 3 drives (out of 4).

Now looking at the mdstat:


md3 : active raid5 sdc1[0] sdf1[4] sdd1[1]
      5860539648  blocks super 1.0 level 5, 128k chunk, algorithm 2 [4/3] [UU_U]
      bitmap: 13/15 pages [52KB], 65536 chunk

Guessing now all I need to do is run fsck on this raid. I can try to re-add the failed sde drive back to the array and see if it rebuilds afterwards. Any thoughts on this?

fsck repaired the raid filesystem. I can mount and successfully read the files in the filesystem, which is fantastic news! However, when I add /dev/sde1 back to the array, the array tries to rebuild but then crashed back to an inactive state (shows only 2 drives then in the array). I’m surprised that it has this behavior instead of marking the drive a failed.

So it seems as though I have at least 1 drive dead, and 1 more on it’s way out. Not sure if I should try to reboot and restart the OS again now?

If I may add, best to make a backup NOW before doing anything.

On 2015-03-20 08:46, PsychoGTI wrote:
>
> fsck repaired the raid filesystem. I can mount and successfully read the
> files in the filesystem, which is fantastic news! However, when I add
> /dev/sde1 back to the array, the array tries to rebuild but then crashed
> back to an inactive state (shows only 2 drives then in the array). I’m
> surprised that it has this behavior instead of marking the drive a
> failed.
>
> So it seems as though I have at least 1 drive dead, and 1 more on it’s
> way out. Not sure if I should try to reboot and restart the OS again
> now?
>
>