High(er) availability options

bujdi · January 23, 2024, 2:27pm

Hello!

I writing this quiestion in the hopes that shomeone could share pointers to documentaion or best practices howto or any other resource which could help me realize the following:

In the not so distant past one could install SuSE linux, later openSUSE linux with software raid (mdraid1 )- where everything resided on software raid: the /boot partition was separate, the system was an other mdraid1 with lvm on top and as the saying goes Bob’s your uncle. With this setup the system could unattendedly boot even with one disk failure. (Yes, there are footguns here, but with proper alerting this was not an issue being undetected…)

I’ve tried leap micro: it cannot boot a degraded btrfs raid1 root unattended, fine, then I’ve tried it manually (being on the mercy of IPMI working…) after many hours of fiddling with rootflags, rd.retry etc. kernel command line paramters in grub and in the dracut shell I was not able to get it to work.
The thing is, btrfs can be successfully mounted degraded, but dracut indeifinetly waits for the missing devices, even with “nofail” in fstab… Maybe i’ve missed something.
I went throug the same ordeail (lets call it testing) with btrfs on top of mdraid1. Still no automatic degraded mdraid boot; trying manually I was not able to get it working either (mdraid array was always inactive in dracut shell, but manually it could be activated, and mounted, but dracut waited on the missing devies.
Ok, this type of setup is probably out of scope for Leap Micro.

Then I tried the same thing with microos latest: in any configuration i could not boot an mdraid1 or btrfs-raid1 root system…

Then I tried openSUSE Leap 15.5: it cold not be booted degraded either automatically, or any manual method I could come up myself, or random internet writeups… and I havent found any official documentation.

Ok, I guess, this type of high(er) availability of systems (just boot it up remotely till the next day I can fetch a hard drive/ssd from the store, or I could go on site etc…) fallen out of favor?

In your opinion what are my options?

keep a whole complete separate system ready to go just in case?

Use hardware raid?

make offline image backups of the system? (I dont’ think consistent online backup is not possible of the system without lvm - even with this, everything is mounted and wired up with disk uuids in dracut, how will I regenerate the initrd in a dd restored copy of the system?)
rebuild the whole system from scratch each time (which even with automation requires much effort and time vs. the old method of shutdown the system, remove bad drive, pop in the new one, mdadm add, resyc and done…)

I hope I’ve just missed the forrest for the trees, and I just have to RTFM, in that case any links and pointers are highly appreciated.

arvidjaar · January 23, 2024, 2:40pm

No, you do not. That is how it is. btrfs developers point fingers at systemd and systemd developers point fingers at btrfs.

This is expected to work. You need to show the actual logs.

If you can this is the best option for your requirements.

bujdi · January 23, 2024, 2:53pm

Will do a proper writeup with logs (by tomorrow I hope). Thank you for the pointer that it is expected to work.