Strange SuSE 11 boot problem: / is on raid1

I have two SATA disks, and I installed SuSE 11 with root partition on raid1.
The setup has three raid 1 partitions /dev/md0-3, with /boot mounted on md0, / on md1 and my /data on md2. Each mdX partition has ReiserFS.
I have installed SuSE 9.x, 10.x on at least 15 production machines, so I installed SuSE 11 here as I usually do, and it went fine. I could reboot fine, I configured grub to fall back on the second disk, so I tested booting with one disk unplugged, second disk unplugged, everything worked. Then I rebooted the machine a couple of times with both disks, everything worked.

After 3 days I rebooted the machine again, grub loaded the kernel, but the kernel could not assemble md1 - there was a message like md1 is unclean, starting background reconstruction, and then something about a bad bitmap file in md1 with error: -5. Then it gave me the sh# prompt, and /proc/mdstat showed no active raid obviously.
I booted with a CD in the rescue mode, and looked at /proc/mdstat:
md0 and md2 were assembled and running on both devices, but md1 had no active devices. I mounted md0, md2, checked the files, everything was intact. Then I tried to assemble md1 with mdadm, but it would either give me “device busy” when I tried mdadm -a /dev/md1 /dev/sdX or “I/O error” when I tried creating /etc/mdadm.conf , listing partitions in there, dumping mdadm --examine --scan >> /etc/mdadm.conf into it, and then doing mdadm --assemble --scan /dev/md1.

So then I reformatted md1 partitions, reinstalled SuSE on md1, did the same testing, reboots, everything went fine, but in 3 days I got the exact same problem.
This time, I recreated md1 in rescue mode, without formatting the partitions (mdadm -C /dev/md1 --level=raid1 --raid-devices=2 /dev/sda1 /dev/sdb1 ), and it rebooted fine after that with all data on md1 intact.
I checked for hardware errors in dmesg, smartctl -a /dev/sdX, IPMI logs, there was nothing bad that I could see.

Do you guys have any advice on this? At this point, I am suspecting 2.6.25 kernel, and I am about to revert to SuSE 10.3

I have had several issues with 11 including: Seagate ST3146855LW and installation - openSUSE Forums

I also had it lose my BIOS RAID, adpatec 29320 U320, 2 Seagate 300 gb SCSI’s in RAID1. After much frustration I went back to 10.3 and everything appears to be peachy again. BIOS RAID and the MD RAID you are using is a little different but loosing the RAID is the same result.

How 10.3 and 11 could be so different in their handling of SCSI and RAID floors me.

Hi,

I experienced exactly the same with a RAID5 root partition consisting out of 4 * 500GB SATA disks. RAID1 seem to be more stable (boot partition) but RAID5 has exactly the same error messages like You:

md2: bitmap file is out of date (2682 < 2683) – forcing full recovery
md2: bitmap file is out of date, doing full recovery
md2: bitmap initialisation failed: -5
md2: failed to create bitmap (-5)
md: pers->run() failed …

invalid root filesystem – exiting to /bin/sh

maybe I should try Your trick with the recreation, but don’t I loose all the data then ?

Hi,

Exactly the same here. Two disks in 3 raid1 partitions:

  • /dev/md0 consists out of /dev/sda1 and /dev/sdb1 mounted on /boot with ext3 fs
  • /dev/md1 consists out of /dev/sda2 and /dev/sdb2 mounted as swap fs
  • /dev/md2 consists out of /dev/sda3 and /dev/sdb3 mounted on / with ext3

In rescue mode md0 and md1 operate OK, md2 cannot run. Looking with mdadm at the details of /dev/md2 doesn’t give any errors at all.
Running an fsck on /dev/md2 isnt willing to start, fsck for md0 goes perfectly and for md1 doesn’t want to run because the fs is swap.

It was running fine for about 2 weeks and after a normal reboot without any errors it would gave the same errors as you two.

I played around with mdadm but i can’t seem to find the problem either. It ends with an invalid root filesystem because md2 can’t start. However if you mount either of the partitions of which md2 is built off you can see that all data is there and seems to be perfectly intact. (so the problem must be mdadm and/or kernel??) I think running an fsck on /dev/sda3 (or sdb3) isn’t a wise thing to do, so i haven’t tried that. As a last (impossible) possibility i tried to break the md and boot with first only the sda and after that with only sdb, but, of course, that didn’t work.

I tried to do alecm3’s trick, but I (stupid me…) can’t find an editor in the rescue mode, and just mdadm --examine --scan > /etc/mdadm.conf will probably not do the trick.

Has anybody tackled this problem yet ? To tar the data to a different disk and than rebuilding the same md setup with an fresh install and then tarring al the data back to the raid also doesn’t sound very appealing because chances are that this problem will reoccur every few days/weeks.

Thanks in advance,
Marco

sorry for bringing up this old thread, but i just got hit by the same problem. i tried rescue mode and manually playing with mdadm, without results.
in the end, i booted opensuse 10.3 cd. when i looked at /proc/mdstats, it had already rebuilt one array (which by now was happily mounting) and started rebuilding the other one. so try booting 10.3, at least until somebody dedicated enough nukes this nasty (probably kernel) bug.

Hi,
The funny thing is that when i posted my message, i had another system also running the same raid config. That one never had problems…
In that system is a DFI mainboard, the raid array is still functioning perfectly even after i have updated it yesterday to 11.1.

So… there are also not so many people posting to this thread, so probably not so many people encounter this same problem. The system I had this problem with had an Asus A7V600 mainboard. Are you maybe using the same mainboard (or the same or almost the same chipsets) ???

Bye,
Marco

I had the same problem(s) with 11.1. Random reboots for a while and then lost the entire raid array. I rebuilt and restored the raid array from backups, but the random reboots continued. I finally ran Yast Online Update. We’ve been up for over two months now!

It seems that a good number of people who try to use software RAID run into problems. My RAID5 array has just given up the ghost, nothing strange happened as far as I can remember, no power loss, no system crash. Now everything I try ends in the familiar

md0: failed to create bitmap (-5)
mdadm: failed to RUN_ARRAY /dev/md/0: Input/output error

All the disks appear healthy, at least they’re visible to the BIOS and appear correctly if I boot with a live CD, but something is properly broken. Some help right now would be just magic.