Leap micro 5.5 degraded boot - reporting success

bujdi · March 22, 2024, 2:47pm

I’m posting this to report my success after my previous post in High(er) availability options - #4 topic (unfortunately by the time I managed to find out the below that topic got locked).

My situation is: using leap micro as a container host on bare metal. I wanted a little bit better availabilty than a single disk install - in that case a full reinstall would have been required in case of boot disk failure. So I’ve installed onto mdraid1 and managed to start the system degraded (in hindsight it is easy, but at the start I was not able to do it nevertheless based on official documentation alone)

Steps:

install leap micro onto mdraid1 arrays with the default btrfs root and var filesystems.
after the installation complete install grub to the second disk (i.e. transactional-update run grub2-install -v /dev/sdb )
after a disk failure in dracut emergency shell bring up the mdraid arrays in degraded mode:
mdraid --run /dev/md126
mdraid --run /dev/md127
mdraid --readwrite /dev/md126
mdraid --readwrite /dev/md127
exit to boot the system
when the system is booted either follow the usual mdraid recovery procedures or let it remain degraded and regenerate the initrd transactional-update run dracut -f –regenerate-all ; reboot → at this point the system can automously boot with the degraded (or repaired) arrays.

Thank you everyone for LeapMicro - it is turning out to be a very lean and reliable base OS.

(So far I could not manage to boot btrfs raid1 degraded - which in theory would be more ideal beause of the self healing possibility.)

Thanks again, have a nice day!

arvidjaar · March 22, 2024, 6:31pm

mdadm package includes udev rules and systemd units to do it automatically. Show full output of

lsinitrd /boot/initrd-$(uname -r)

The output may be long, upload to https://paste.opensuse.org/

bujdi · March 23, 2024, 7:59am

test system: virtualbox with 2 sata virtual drives. Leap Micro 5.5 installed with the installer.

output of lsblk and lsinitrd in the paste: openSUSE Paste

After removing sdb the system boots into the dracut emergency console (this was the same since Leap Micro 5.2 AFAIR).

On the system (and in the output of lsinitrd) I can see mdadm-last-resort.(timer|service). Investigating in the emergency shell to me it seems that these units are not loaded by systemd, hence do not even run.
I’ve attached the following from the emergency shell after removing sdb and failing to boot:
rdsosreport.txt
contets of the journal
output of systemctl

Thank you in advance!

arvidjaar · March 24, 2024, 8:03am

Those units are normally run via udev rules, and these udev rules are even installed, but dracut explicitly disables those rules in favour of own logic. I installed Tumblweed on RAID1 and it works correctly, dracut timeout processing kicks in:

Mar 24 10:39:05 localhost dracut-initqueue[409]: Warning: dracut-initqueue: starting timeout scripts
Mar 24 10:39:05 localhost dracut-initqueue[3548]: mdadm: started array /dev/md/root
Mar 24 10:39:05 localhost kernel: md/raid1:md126: active with 1 out of 2 mirrors
Mar 24 10:39:05 localhost kernel: md126: detected capacity change from 0 to 79603712
Mar 24 10:39:05 localhost dracut-initqueue[3563]: mdadm: started array /dev/md/swap
Mar 24 10:39:05 localhost kernel: md/raid1:md127: active with 1 out of 2 mirrors
Mar 24 10:39:05 localhost kernel: md127: detected capacity change from 0 to 4188160

after that I realized that you have Leap Micro. I never used it, so hopefully someone can chime in. But looking at the rdsosreport.txt you provided - it has single device RAID1, not degraded two devices, there are no timeout messages either, so apparently when you regenrated initrd after booting in degraded mode you basically made dracut ignore the second missing disk.

If you want to continue troubleshooting it - repair RAID1, make sure it has both pieces active, show the actual status that we can verify, then remove one disk and provide journalctl -b output (and rdsosreport.txt if generated) at this point.

bujdi · March 24, 2024, 11:38am

Thank you for your effort!

Now I’ve made a clean install of Leap Micro 5.5 on top of mdraid1 (just as above).
Powered off the VM, removed sdb, powered back on and let the normal boot process run. The system stopped at the emergency prompt: I did not log in, just reset the machine and from the boot menu chose the advanced/recovery mode option. After this the system automatically booted the degraded array. Rebooting after this the normal boot menu entry also boots the degraded array.
Now, this works well enough for me to stop further investigation (maybe it is working as inteded? also this was not the case with previous leap micro versions, but the past is the past ).
Thanks for the comments, I do not want to further spend anyones’ time on this matter, it works well enough this way to help in my everyday workloads immensely!

system · April 23, 2024, 11:38am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.