EFI boot from software raid1

I’m planning a leap42.3 installation. I’m not new to linux nor openSUSE, but it’s the first time i’m using a UEFI firmware.
In my typical partition layout there are two identical disks

raid1 → /boot
raid1 → encrypted swap
raid1 → LVM → root
raid1 → LVM → home
raid1 → LVM → var
…other data partitions…
With bios and mbr the trick to have a system capable of booting with a failed disks is to have a separate raid1 device with /boot and with grub installed on both MBR. Yast in the most recent versions was able to do the trick automatically.

Googling around I understood that the /boot/efi partition, which contains the bootloader in a UEFI system, cannot be a raid1 device, instead it must be a plain VFAT partition, otherwise the UEFI bios cannot detect and boot the bootloader.
So my question now is, how can I build a fully redundant system which is capable to boot with a failed disks, using GPT and pure UEFI bios ? yast does not have, apparently, any options to manage a redundant /boot/efi
I’ve read several articles and posts on this topic without finding a definitive answer. Most of the solutions i found are based on a manual copy of the /boot/efi partition on to the secondary disk into an identical empty partition (dd or cp -a).
Those kind of solutions seems very fragile because you have to manually resynchronize the “secondary” EFI partition after grub package update, kernel upgrade and … (what else ???)

There is an official opneSUSE way to do this ? any experiences ?

Thank you very much

Need to have the EFI boot partition outside of any LVM, If you don’t want to redo your partitions you can still use MBR boot ie legacy. Boot installer in legacy mode and do as you always have.

You must have EFI boot if drive is larger then 2 gig or you multi-boot and the other OS boot EFI. ie all OS must use the same boot method.

Note you have a strange LVM set up generally you have one LVM with all other partitions as sub volumes. Not wrong just strange:O

It is almost (but not quite) automatic.

Okay, I don’t use RAID, and I have no way of testing this – well, I suppose that I could remove a disk.

I have two hard drives. I created an EFI partition on each drive.

I am only using the EFI partition on “/dev/sda”. But I occasionally copy the content of that to the EFI partition on “/dev/sdb” (using “rsync”).

Looking at the EFI partition on “/dev/sda”. In the directory “/EFI/boot” (relative to the EFI partition), I see two files. They are “bootx64.efi” and “fallback.efi”. They were actually put there by openSUSE install (but not if Windows is already using that directory). And they are maintained by opensuse updates. Those two files have a Jul 22 date, due to a recent update.

The file “bootx64.efi” is actually identical to “shim.efi” that is in the “/EFI/opensuse” directory. And the file “fallback.efi” comes from “/usr/lib64/efi”.

The “bootx64.efi” in that “/EFI/boot” directory is supposed to be called by the firmware as either a boot choice from the BIOS boot menu, or as a fallback if all else fails. And if it finds “fallback.efi” it is supposed to call that. In turn, “fallback.efi” is supposed to look at other directories in the EFI partition, and if it finds a file “boot.csv” it is supposed to use that to recreate boot entries.

I’ve actually seen that work with “/dev/sda”, but I haven’t tested on “/boot/sdb”.

If “/dev/sda” fails, then my assumption is that the firmware (BIOS) should try to boot “/dev/sdb”, which should find that “bootx64.efi” and “fallback.efi” and recover.

Hmm, it does occur to me that the fallback support may only be installed if you install “secure-boot” support.

No, firmware will not try anything that is not defined in EFI boot manager configuration (which is managed by efibootmgr). So to have ESP redundancy you also need to add boot item that points to (files on) this second ESP.

Otherwise your approach is correct. The way to have redundant EFI boot is to have second ESP with identical content. To my best knowledge nobody implemented complete solution so far (once upon a time there was a company that offered this for Windows and Linux on Itanium). I suppose it could be done by adding custom RPM with triggers on grub2-efi package as proof of concept.

In principle it is possible to create RAID1 with metadata at the end of partition, but a) I am not sure whether grub2 will accept it and b) ESP is writable by firmware, and firmware is not aware of Linux MD, so you risk getting two mirror pieces out of sync - which will at some point be disastrous for filesystem.

Do you you think that booting in legacy mode does not limit me in anyway? (linux is the only installed system, no windows) Pci device with their own uefi firmware are not impacted? Overall performance? Can i add to the system disks larger then 2 TB (non boot) with gpt on it ? (I think yes, the problem with legacy bios should be only with the boot disks)
Regarding my lvm configuration i described it in a too cryptic way…I have one volume group over raid1 device. Inside this volume group I have several logical volumes

This is also my conclusion (after a couple of hours of googling), copying esp partition to the second disk is the only practial solution for now, but i’m surprised that no distro has managed this in an automated way.

MBR boot is fine. You can have other drives greater. It is a basic limit of MBR and FAT partitioning. Also you can have GPT partitioning and do MBR but there is some limit on where boot partitions live. I’ve forgotten the limit. Probably under 1 TB

There are 3 ways to implement RAID FAKE (BIOS assisted), Software, AND true hardware. True is transparent to the OS the control presents the array to the OS as a single drive. FAKE uses a cheap, usually on the MB, chip to assist on boot but then it is software under OS control.

i’m referring to linux software raid (the one you manage with mdadm). True hardware raid of course do not pose any problem because both UEFI bios and the OS see only one EFI partition. I’ve never used fake raid, i think software raid is far superior and less error prone.

It seems like all major distros have ignored the problem until now and this really surprise me.
I found this interesting thread on the Debian mailing list:

As you can see someone has started to work on a solution but, unfortunately, does not have enough time to complete his job. Another interesting thing is the manual approach he proposes: grub-install + efibootmgr

It would be nice to determine at least the right command line options for grub-install in order to manually manage a double (triple, quadruple …) ESP partition in a standard leap 42.3 installation. Someone would like to help?
efibootmgr instead is a mistery for me. I’ve never correctly understood why it is needed. On my machine if i attach a disk (or usb device or DVD) with an ESP partition on it, automatically it appears on the bootable device list. what am i missing ?

Yet, somehow, I am able to boot the openSUSE installer on a USB flash drive, even though that flash drive is not defined in the EFI boot manager configuration.

Yes, strictly speaking fallback \EFI\Boot\bootx64.efi is defined for removable drives, so this is normal and firmware is expected to offer boot “from USB” while actually attempting to load this file from removable device. For true HDD it is implementation defined. Also, I meant replicating ESP including \EFI\opensuse directory, and this directory is definitely not used by default.

I further investigated, with the aim to find a reliable enough procedure to obtain a bootable machine in case of a degraded RAID1 with the /boot/efi partition lost with the broken disk.
The best procedure i was able to create is:

during the installation:

  1. both disk with the same partition schema so both disks with an EFI vfat partition at the beginning, WITHOUT raid
  2. one partition is mounted as the official /boot/efi, the other one is generically mounted as /boot/efi2 (the mount point is not important)
  3. disable secure boot (i don’t think this is strictly necessary, in my case secure boot was not needed, so disabling it has had the only positive effect to reduce the complexity and the number of files inside the EFI partition)
  4. install the system

after the first boot
5) cp -av /boot/efi/* /boot/efi2/ (if you disable secureboot the only file to be copied is /boot/efi/EFI/opensuse/grubx64.efi)
6) edit /etc/fstab and add the nofail option to the /boot/efi/ partition. This is necessary because dracut refuse to boot if a partition without the nofail option is not available and drop you into an emergency shell. (for the same reasons you should add the nofail option also to all partitions on the same disks which are not raided)
7) issue a dracut --force. This is necessary because you need to import your modified fstab into the initrd image

reboot the system and you are ready to simulate a failed disks. The system should be able to boot without any manual intervention whatever is the broken disk

The optimal solution, in my opinion, should be to manage the double copy of the EFI partition automatically by the yast installer and by dracut and grub packages

Just to share, in the hope this can help.

In this case you need to repeat it after every grub2 update. It is actually possible to create dummy RPM which with trigger script that does it.

Right, thank you very much.

Just to share: I think there is a regression in the last dracut and/or systemd update that does not allow the system to boot if /boot/efi is missing (with the configuration i described above).
grub is able to load and boot the initrd image, but dracut then refuse to continue and complains for /boot/efi partition missing, which is a that point no more needed because initrd boot phase is almost at the end and the root partition is available.
I opened a bug

What happens if you add the “nofail” option to “/etc/fstab” on the entry for “/boot/efi”? Does it then allow boot to continue after a failure to mount that partition?

I added the nofail options to /boot/efi since the installation time. It worked fine with the system just installed, but after a full update the problem come out. For this reason I suspect dracut or systemd regression

While waiting for the bug resolution i did a lot of research with the aim to find at least a workaround and i discovered something completely new!

In the SLES 12 Storage Administration Guide i found this:

For UEFI machines, you need to set up a dedicated /boot/efi partition. It needs to be VFAT-formatted, and may reside on the RAID 1 device to prevent booting problems in case the physical disk with /boot/efi fails.

In the opensuse documentation instead I did not find anything similar, it only mention /boot redundancy in MBR formatted disks (legacy bios scenario), but this discovery sparked my curiosity and i did a test installation with /boot/efi on a raid1. With my great surprise all have worked fine!!! Yast has selected raid metadata v1.0 (which means metadata at the end of the partitions, by this way the two raid device look like two normal vfat partitions froma uefi point of view) and has configured 2 new boot entry in the UEFI nvram. I was able to correctly boot the system in normal condition and with degaded raid.

So the truth is Opensuse has completely managed the problem of efi redundancy, in a fully automated way (perhaps the first distro in doing this as far as i know) but this feature is completely undocumented and hidden!

My only remaining question: is it something which is still considered experimental or it is simply an incredible missing information in the documentation? May i safely use this powerfull hidden feature?

That’s new to me. I will try to test. Did you use Leap or Tumbleweed? Could you describe in more details how you setup disks (I presume, you had to go into expert mode for this)?

May i safely use this powerfull hidden feature?

Well, you already more or less answered this:

These partitions are writable at boot time (before anything is loaded); as firmware is not aware of RAID, writing to individual partitions will cause content mismatch. This has potential to destroy filesystem integrity by fetching the wrong content from the “wrong” mirror piece (in the best case you will not see content you wrote in EFI after booting Linux).

Writing is not something that happens often, but a) it is common method to pass diagnostic information between EFI (Shell) and OS b) any (boottime) application may potentially use ESP for storing persistent information.

So IMHO this is the worst solution - it appears to work most of the time, but it is known to be broken in corner cases and when it breaks it is absolutely mysterious.

I continue to claim that the right solution is to simply have two independent partitions and manually copy information between them when bootloader is updated.

I used leap 42.3. this is the partition layout

sda                        8:0    0 20.9G  0 disk  
├─sda1                     8:1    0    1G  0 part  
│ └─md0                    9:0    0    1G  0 raid1 /boot/efi
├─sda2                     8:2    0    1G  0 part  
│ └─md1                    9:1    0    1G  0 raid1 /boot
├─sda3                     8:3    0    1G  0 part  
│ └─md2                    9:2    0    1G  0 raid1 
│   └─cr_swap            254:4    0    1G  0 crypt [SWAP]
├─sda4                     8:4    0   15G  0 part  
│ └─md3                    9:3    0   15G  0 raid1 
│   ├─systemVG-rootLV    254:0    0   10G  0 lvm   /
│   ├─systemVG-homeLV    254:1    0    1G  0 lvm   /home
│   ├─systemVG-privateLV 254:2    0  640M  0 lvm   
│   │ └─cr_private       254:5    0  638M  0 crypt /private
│   └─systemVG-varLV     254:3    0    2G  0 lvm   /var
└─sda5                     8:5    0    1G  0 part  /space1
sdb                        8:16   0 20.9G  0 disk  
├─sdb1                     8:17   0    1G  0 part  
│ └─md0                    9:0    0    1G  0 raid1 /boot/efi
├─sdb2                     8:18   0    1G  0 part  
│ └─md1                    9:1    0    1G  0 raid1 /boot
├─sdb3                     8:19   0    1G  0 part  
│ └─md2                    9:2    0    1G  0 raid1 
│   └─cr_swap            254:4    0    1G  0 crypt [SWAP]
├─sdb4                     8:20   0   15G  0 part  
│ └─md3                    9:3    0   15G  0 raid1 
│   ├─systemVG-rootLV    254:0    0   10G  0 lvm   /
│   ├─systemVG-homeLV    254:1    0    1G  0 lvm   /home
│   ├─systemVG-privateLV 254:2    0  640M  0 lvm   
│   │ └─cr_private       254:5    0  638M  0 crypt /private
│   └─systemVG-varLV     254:3    0    2G  0 lvm   /var
└─sdb5                     8:21   0    1G  0 part  /space2

A separated /boot probably is not necessary.
I tested this layout on both KVM and Virtualbox, simulating, in both case, a disk failure.
I also fully patched the system after installation (new kernel and new initrd image where installed without problem)
I’m planning a test on a physical hardware.

I completely agree with you and all the information i found goes in that direction. For that reason i was implementing the manual procedure i described in my previous posts, but then I faced with a bug and while waiting for the resolution i started experimenting. What is surprising me is that this is the official SLES 12 way to obtain EFI partition redundancy!