System won't boot with degraded Fake RAID 5

Hello,
I’m no expert with Linux so your patience is appreciated…

I have OpenSUSE 12.2 (headless, text mode only) installed on a Fake RAID 5 (4 drives). Last night one of the drives failed. From my past experiences with RAID arrays I assumed that if 1 drive fails the array keeps working and so does the OS. Not the case here though. First the OS started to behave strange (terminal freezes, virtual machines not responding and unable to DESTROY them). I rebooted the box but it no longer boots up. At this stage I assume my data is in tact and if I rebuild the array I’ll get the system to work again, but is there a RAID 5 configuration on OpenSUSE that works while degraded? I was reading about the software RAID and I’d even prefer that and use RAID 6, but after the installation the OS wouldn’t boot from it. OpenSUSE 12.3 would not install at all with any RAID - would not boot on software raid after install and would not install on my Fake RAID (apparently 2.7 TB is not enough for it, or at least that’s the error I was getting).

In terms of booting up now, it spits up a bunch of errors and hangs on “recovering journal” forever.

Any help appreciated.

esiu69 wrote:
> Hello,
> I’m no expert with Linux so your patience is appreciated…
>
> I have OpenSUSE 12.2 (headless, text mode only) installed on a Fake
> RAID 5 (4 drives). Last night one of the drives failed. From my past
> experiences with RAID arrays I assumed that if 1 drive fails the array
> keeps working and so does the OS.

Yes, that’s a fair assumption.

> Not the case here though. First the
> OS started to behave strange (terminal freezes, virtual machines not
> responding and unable to DESTROY them). I rebooted the box but it no
> longer boots up.

That makes it sound like there is some other problem. Possibly another
disk has failed or perhaps the problem is with the cables, controller or
motherboard instead. Or maybe it isn’t really a RAID5.

> At this stage I assume my data is in tact and if I
> rebuild the array I’ll get the system to work again, but is there a RAID
> 5 configuration on OpenSUSE that works while degraded? I was reading
> about the software RAID and I’d even prefer that and use RAID 6, but
> after the installation the OS wouldn’t boot from it. OpenSUSE 12.3
> would not install at all with any RAID - would not boot on software raid
> after install and would not install on my Fake RAID (apparently 2.7 TB
> is not enough for it, or at least that’s the error I was getting).

I know nothing at all about using Fake RAID and what is possible or not
with it. I’m not sure whether you are reporting a previous attempt to
install 12.3 or you are saying that you’re trying to install it now. If
the latter, STOP! Sort out your problems first.

> In terms of booting up now, it spits up a bunch of errors and hangs on
> “recovering journal” forever.

Please post the errors. Take a photograph of the screen and post that if
necessary. Do check that it’s legible!

Please also post details of your hardware configuration.

> Any help appreciated.

I would suggest booting your machine from a DVD or CD and then we can
help you use the command-line tools to investigate what is happening.
Other will probably have a better idea of what disk to use - personally
I would use Knoppix, but I expect openSUSE also has something suitable.

Cheers, Dave

Hi Dave,
Thanks for looking at my query.

It is RAID 5 set up in BIOS on 4x1TB Seagate drives.

I don’t think another disk failed. The one that failed is no longer detected by BIOS but the system boots up a little bit. If 2 RAID5 drives were dead, there would be nothing coming up as 1/3 of the actual data would be gone. And the RAID status at boot-up would be FAILED not DEGRADED.

That was a previous attempt of 12.3. I went to 12.2 because 12.3 didn’t install in any RAID configuration.

My hardware is:
Processor Intel Core-i7 3770K 4x3.5GHz
Ram 16GB (2x8GB) Corsair CML16GX3M2A1600C10
Motherboard Asus P8H77-I
No dedicated graphics
ACE 500W PSU
Hard drives (before the fault):
1x3TB Seagate ST3000DM001 (port 0)
1x1TB Seagate ST31000524AS (port 1) (spare for RAID)
4x1TB Seagate ST31000528AS (ports 2-5) in RAID5 (Intel)

As an update, I replaced the dead hard drive from port 2 with the spare from port 1 (added new hard drive to the array in BIOS) and the system went bananas. I will take a pic of the screen when I get home. It says something about recursive fault, kernel panic and CPU deadlock. I think it may be caused by all my drives changing names (what was hdd became hdc, hde->hdd, hdf->hde), not sure about that though.

I have a Knopix Live USB drive and I’ll be able to retrieve my data (if any survived). I’m pretty much only after my config files and my virtual machine storage & config; I have backups of everything else. With that I can rebuild the box in few hours. It didn’t go to “production” as I wasn’t done setting it up yet.

What I’m really after is a reliable setup for the future. In the Windows world, RAID5 set up in BIOS always did for me what I wanted from it, e.g. it protected me from a drive failure. Windows even diplayed a little cloud tip in the notification area that the volume is degraded and sent me an email. That’s not the case with OpenSUSE though. One drive fails and the box is dead. I replace the drive and the box is still dead. That’s below my expectations regarding RAID5…

I see that a Fake RAID5 on OpenSUSE is pretty much a no-go. Therefore, if anyone can advise how to set up a software RAID 6 on 5 drives meeting the below requirements it’d be great:

  • the system boots from RAID 6
  • if any 2 drives fail, the OS still boots up and the data remains unaffected (a notification email would be nice)
  • unplugging the drive on port 0 doesn’t cause issues

I haven’t really exhausted all the options. Few things I can think of are:

  1. Leave only the drives I want for RAID during the OS install. Would that make a difference during the first OS boot after install?
  2. Create a small software mirror on all 5 drives for the boot partition. That’s still RAID though so I don’t think it’ll make a difference?

Yes, please. This will give at least some starting point.

Windows even diplayed a little cloud tip in the notification area

Oh, even when you run it headless, text mode only?

and sent me an email.

I do not know how your array was set up in Linux, but if it was using Linux MD there is “mdadm --monitor”. There can also be specialized software from vendor, did you check?

One drive fails and the box is dead.

For comparison to be fair you need to test another operating system in exactly the same conditions.

esiu69 wrote:
> What I’m really after is a reliable setup for the future. In the
> Windows world, RAID5 set up in BIOS always did for me what I wanted from
> it, e.g. it protected me from a drive failure. Windows even diplayed a
> little cloud tip in the notification area that the volume is degraded
> and sent me an email. That’s not the case with OpenSUSE though. One
> drive fails and the box is dead. I replace the drive and the box is
> still dead. That’s below my expectations regarding RAID5…

As I said, there’s some other problem. Linux will happily cope with
degraded RAIDs using either hardware RAID controllers or software
(mdadm) RAID. However, I have no idea about fakeraid.

> I see that a Fake RAID5 on OpenSUSE is pretty much a no-go. Therefore,
> if anyone can advise how to set up a software RAID 6 on 5 drives meeting
> the below requirements it’d be great:
> - the system boots from RAID 6
> - if any 2 drives fail, the OS still boots up and the data remains
> unaffected (a notification email would be nice)
> - unplugging the drive on port 0 doesn’t cause issues
>
> I haven’t really exhausted all the options. Few things I can think of
> are:
> 1. Leave only the drives I want for RAID during the OS install. Would
> that make a difference during the first OS boot after install?
> 2. Create a small software mirror on all 5 drives for the boot
> partition. That’s still RAID though so I don’t think it’ll make a
> difference?

Personally, I wouldn’t use RAID6 - read up about its advantages and
disadvantages before you do so. I use software RAID 10. I also never try
to boot from RAID. I have a single boot+system disk and I have a backup
boot+system disk. Then all the data is held on a RAID. The reason I do
that is that I’ve found from experience that I can recover from problems
faster using that technique. But your requirements may be different.

@arvidjaar

Don’t expect too much of the photo, I don’t think you’ll see that much. About 3/4 of the screen is a binary dump of something. Unless you can make sense of it…

Fair point about the headless :). The main reason why I moved to Linux is that I don’t want some pointless GUI to eat up my modest resources… But I have to make it work. And I’ll be testing by disconnecting disks until I get it right.

There are no Linux drivers for the board from Asus. Possibly there are from Intel for the H77 chip, or from another vendor with the same chip. Linux didn’t scream for them so I didn’t bother to look that deep. I’m pretty sure Linux copes with the array using MD as there are some mdadm commands on the screen.

I did test the same setup with WinXPx64 before. The array failed multiple times due to a bad molex plug on the PSU, before I figured out what the problem was. Never any data loss or system instability occured. True though that Windows has dedicated drivers for the H77 RAID.

@djh-novell

Fake raid is basically the Intel chip-managed RAID that you set up in BIOS. It’s supposed to be hardware but it’s not, hence the Fake. The H77 chip (in my case) provides some RAID capability that is pretty much a software RAID but the chip provides the boot-up capability. I’m sure Linux can deal with it but obviously I have done something wrong that it didn’t, so I’m asking for help in setting it up right this time. I honestly don’t think there’s anything else wrong with the hardware.

For RAID 10 you need an even number of hard drives plus you only get 50% of your disk space back, plus some performance improvement. The performance of my server reaches the maximum LAN bandwidth (1Gbps) so no further performance is required. I could use the extra space though. I have 5 drives for storage plus one for backup. If a drive dies, I’m unconfortable on running the system without data protection until a replacement arrives. So I can run these 5 drives in 3 possible modes:

  1. Raid 10 (2+2) + spare - I get 2TB
  2. Raid 5 (4) + spare - I get 3TB
  3. Raid 6 (5) - I get 3TB plus I don’t have to worry about swapping the drives and rebuilding the array as soon as one drive dies (my BIOS doesn’t give me a “hot spare” option and I have to do it manually).

It may still give some hints. Upload it to susepaste.org.

OK, here it is (or they are).

http://emsoft.com.pl/1.jpg is what I was left with after the reboot.

http://emsoft.com.pl/2.jpg is what happened when I replaced the hard drive.

Funny thing: I can go back and forth between the two by connecting/disconnecting the spare drive.

I disconnected the backup drive (sda) coz if that one goes everything goes…

Knoppix 7.0 LiveCD doesn’t see the RAID array. It only sees the 4 drives and can’t determine the file system. It sees the BOOT and RAID flags though. BIOS sees it fine in either Degraded or Rebuild state (depending whether I plug in the spare drive or not)… I have a feeling that if I rebuilt the array in Windows all would go back to normal… For now I have no access to data. Any ideas? :smiley:

No, it does not. What you see on screen is error messages that mdadm was not found.

Your first picture actually looks pretty normal. I presume system just stops at this point (did you wait long enough? Was there any disk activity?) which matches your description that system hung during operation. Second picture is kernel bug.

I do not know to which extent Knoppix performs auto detection; try “mdadm --examine --scan --config=partitions --verbose” whether it finds anything. Also “dmraid -r” to compare with what dmraid sees. Post output here.

Yes, so I assumed it was tried to be used…

Your first picture actually looks pretty normal. I presume system just stops at this point (did you wait long enough? Was there any disk activity?) which matches your description that system hung during operation.

Yes, I left it like this and went to work. No disk activity. After a while (1 hour+) things start to disappear from the screen. Some letters or full lines of text, or sometimes few letters from a line would stay behind. No disk activity. Eventually the screen was blank (once everything disappeared). This prompted me to kick off memtest for tonight, just in case…

I do not know to which extent Knoppix performs auto detection; try “mdadm --examine --scan --config=partitions --verbose” whether it finds anything. Also “dmraid -r” to compare with what dmraid sees. Post output here.

Will do tomorrow, after my memtest finishes.

Thanks for your time.

root@Microknoppix:/home/knoppix# mdadm --examine --scan --config=partitions --verbose 
ARRAY /dev/md/0 level=raid6 metadata=1.0 num-devices=5 UUID=b516a27d:93047721:42d2f61d:695fcf18 name=192.168.10.99:0
    devices=/dev/sdc1,/dev/sdb1,/dev/sda1 
ARRAY metadata=imsm UUID=1376fef0:bf011589:ccad5658:c3e18e2c
    devices=/dev/sdd,/dev/sdc,/dev/sdb,/dev/sda 
ARRAY /dev/md/R5V1 container=1376fef0:bf011589:ccad5658:c3e18e2c member=0 UUID=f3de1ed4:f7f28c69:488d1945:2e3d546f

Also “dmraid -r” to compare with what dmraid sees. Post output here.

root@Microknoppix:/home/knoppix# dmraid -r 
/dev/sdd: isw, "isw_cjfajdhaj", GROUP, ok, 1953525166 sectors, data@ 0 
/dev/sdc: isw, "isw_cjfajdhaj", GROUP, ok, 1953525166 sectors, data@ 0 
/dev/sdb: isw, "isw_cjfajdhaj", GROUP, ok, 1953525166 sectors, data@ 0 
/dev/sda: isw, "isw_cjfajdhaj", GROUP, ok, 1953525166 sectors, data@ 0

Any ideas? :smiley:

Funny this thing. It’s a leftover from my 12.3 attempt of RAID6. I wonder if it has anything to do with the crash. Also wonder how it remained there somewhere…

Here’s what I did before: Created software RAID6 during OpenSUSE 12.3 install with 5 drives. Some files were copied etc. but the system failed to boot from the RAID. Not giving it much thought I went in to BIOS and set up RAID5 with 4 drives. I named the volume “R5V1” - I guess the other 2 lines refer to that one.

I do not really like “ok” returned by dmraid. It suggests, that all 4 devices are marked as clean which is wrong according to your description. Is it with spare disk or original disk?

At this point I would try to remove both failed and spare disk and try to activate degraded array with three disks using mdadm:

mdadm --assemble --scan --config=partitions --uuid=1376fef0:bf011589:ccad5658:c3e18e2c
mdadm --assemble --scan --config=partitions --uuid=f3de1ed4:f7f28c69:488d1945:2e3d546f

This may give you access to data. We may think what to do next if it will be successful.

The original disk is no longer detected by BIOS nor OS so it landed in the bin. The spare is now connected and included in the array in BIOS. The array status in BIOS is Rebuild. Normally what happens now is the array rebuilds automatically after Windows is started (and the Intel Storage Matrix Manager app which starts automatically). I found instructions on how to rebuild in on Linux here: Techie.org and I was wondering whether to give it a go? If you think it’s a bad idea, I can try to do your stuff but you’re saying to disconnect the spare disk first, right?

Those instructions presume dmraid sees disk as broken. Your output shows all 4 disks ok. Could you show output of “dmraid -s”?

Right, I think all is gone now. I shut down the system and disconnected the spare drive. When I brought the system up, BIOS no longer saw a RAID volume for some reason. I shut it down and reconnected the drive but that made no difference. So I let the system start up and forgetting that I had all 4 drives connected, i ran

root@Microknoppix:/home/knoppix# mdadm --assemble --scan --config=partitions --uuid=1376fef0:bf011589:ccad5658:c3e18e2c
mdadm: Container /dev/md/imsm0 has been assembled with 4 drives
root@Microknoppix:/home/knoppix# mdadm --assemble --scan --config=partitions --uuid=f3de1ed4:f7f28c69:488d1945:2e3d546f
mdadm: No arrays found in config file or automatically

Anyway, the results from the stuff I ran earlier are different now.

root@Microknoppix:/home/knoppix# mdadm --examine --scan --config=partitions --verbose
ARRAY metadata=imsm UUID=1376fef0:bf011589:ccad5658:c3e18e2c
   devices=/dev/sdd,/dev/sdc,/dev/sdb,/dev/sda
ARRAY /dev/md/R5V1 container=1376fef0:bf011589:ccad5658:c3e18e2c member=0 UUID=f3de1ed4:f7f28c69:488d1945:2e3d546f

root@Microknoppix:/home/knoppix# dmraid -r
ERROR: isw: Could not find disk /dev/sdd in the metadata
/dev/sdc: isw, "isw_cjfajdhaj", GROUP, ok, 1953525166 sectors, data@ 0
/dev/sdb: isw, "isw_cjfajdhaj", GROUP, ok, 1953525166 sectors, data@ 0
/dev/sda: isw, "isw_cjfajdhaj", GROUP, ok, 1953525166 sectors, data@ 0
root@Microknoppix:/home/knoppix# dmraid -s
ERROR: isw: Could not find disk /dev/sdd in the metadata
ERROR: isw: wrong number of devices in RAID set "isw_cjfajdhaj_R5V1" [3/4] on /dev/sda
ERROR: isw: wrong number of devices in RAID set "isw_cjfajdhaj_R5V1" [3/4] on /dev/sdb
ERROR: isw: wrong number of devices in RAID set "isw_cjfajdhaj_R5V1" [3/4] on /dev/sdc
*** Group superset isw_cjfajdhaj
--> *Inconsistent* Subset
name   : isw_cjfajdhaj_R5V1
size   : 3907041280
stride : 256
type   : raid5_la
status : inconsistent
subsets: 0
devs   : 3
spares : 0

I’m not sure why it’s complaining about disk 5 as disk 2 was the one replaced (it’s 1 now).
When I ran dmraid -s without the spare drive the status was “failed”.

BTW, I did run dmraid -s before I shut the system down with 4 disks saying OK. The status was “no sync”. So thinking about it now, I may have been half way through those instructions. He used “dmraid -R” to add another disk to the volume while I did that in BIOS (or actually the Inter RAID ROM Option). So maybe kicking off the rebuild somehow would have been sufficient?

Anyway, that doesn’t matter now. Do you think there are any chances to retrieve my configs, or should I just scrap the whole thing and start over?

I guess at this point it really makes to stop experimenting and try some data recovery software. I never used one so cannot really recommend, hopefully others may give helpful suggestions.

Well, I think the data is still in tact. I’m unable to mount the newly created container though as I get a “can’t read superblock” error. I’ve been reading about people recreating the superblock on a degraded array, for example here:
RAID Superblock disappeared from one of my disks after crash. Fix?
[SOLVED] Problem with software raid! inactive array, won’t mount](http://www.linuxquestions.org/questions/linux-software-2/problem-with-software-raid-inactive-array-wont-mount-944762/)
linux - How to get an inactive RAID device working again? - Super User

For some reason, what assembles from my 4 drives is called /dev/md/imsm0 or ./imsm. What I want is R5V1. I tried experimenting but all I got was this:

root@Microknoppix:/home/knoppix# mdadm --stop /dev/md/imsm0
mdadm: stopped /dev/md/imsm0
root@Microknoppix:/home/knoppix# mdadm --assemble --scan -fv
mdadm: looking for devices for further assembly
mdadm: no RAID superblock on /dev/sde1
mdadm: no RAID superblock on /dev/sde
mdadm: no RAID superblock on /dev/zram0
mdadm: no RAID superblock on /dev/cloop
mdadm: /dev/sdd is identified as a member of /dev/md/imsm, slot -1.
mdadm: /dev/sdc is identified as a member of /dev/md/imsm, slot -1.
mdadm: /dev/sdb is identified as a member of /dev/md/imsm, slot -1.
mdadm: /dev/sda is identified as a member of /dev/md/imsm, slot -1.
mdadm: Marking array /dev/md/imsm as 'clean'
mdadm: added /dev/sdc to /dev/md/imsm as -1
mdadm: added /dev/sdb to /dev/md/imsm as -1
mdadm: added /dev/sda to /dev/md/imsm as -1
mdadm: added /dev/sdd to /dev/md/imsm as -1
mdadm: Container /dev/md/imsm has been assembled with 4 drives
mdadm: looking for devices for /dev/md/R5V1
mdadm: Cannot assemble mbr metadata on /dev/sde1
mdadm: Cannot assemble mbr metadata on /dev/sde
mdadm: no recogniseable superblock on /dev/zram0
mdadm: /dev/sdd has wrong uuid.
mdadm: /dev/sdc has wrong uuid.
mdadm: /dev/sdb has wrong uuid.
mdadm: /dev/sda has wrong uuid.
mdadm: no recogniseable superblock on /dev/cloop
mdadm: looking for devices for /dev/md/R5V1
mdadm: Cannot assemble mbr metadata on /dev/sde1
mdadm: Cannot assemble mbr metadata on /dev/sde
mdadm: no recogniseable superblock on /dev/zram0
mdadm: /dev/sdd has wrong uuid.
mdadm: /dev/sdc has wrong uuid.
mdadm: /dev/sdb has wrong uuid.
mdadm: /dev/sda has wrong uuid.
mdadm: no recogniseable superblock on /dev/cloop

And again, my R5V1 didn’t get assembled while the other thing did.

I’m not desperate enough to try data recovery software. I’m willing to play with the system until the replacement hard drive arrives, which should be around Friday hopefully. After that it’s a full rebuild. Unless someone can recommend something to rebuild the superblock and get my data out straight away…

What I want is to assemble drives a, b, c and d into a degraded R5V1 with drive a marked for rebuild. Reading the linked threads makes me think it is possible but my lack of skills is in the way :).

If someone wants to give it a go I’m willing to give you the terminal access to the box :smiley:

That is correct. This is top-level container

What I want is R5V1

This is (was?) located inside of imsm0. Now metadata for it cannot be found so it cannot be assembled.

Reading the linked threads makes me think it is possible

All those threads are about pure software Linux MD. It is easy (for some definition of “easy”). Here we are facing big black box - your BIOS. We do not know what your BIOS does. And we cannot bypass it.

It may be possible to re-create array inside of container, but any additional move will double the probability of data loss. And it should be done by someone with data recovery experience.