Is opensuse appropiate for redundant (ESP + roootfs) and unattended operation?

I don’t know if the title is clear enough but I need to send some hardware that will run in a remote area and unattended, so I want to ship it with a redundant ESP and a redundant rootfs, so it can continue to work and boot as usual even when one drive missbehaves or totally fails.

For the ESP I created my own mount and sync custom scripts and systemd units, because I have read that mdadm 1.0 mirror for ESPs don’t always work and I didn’t see any native/official ESP sync tools on opensuse. So anyway, I think that part should be done.

But the actual hard part has turned out to be BTRFS. I have done this in the past with ZFS on proxmox and it was a non-issue but BTRFS seems like is not build by default to boot from a degraded mirror or array in general even if there is enough redundancy. rootflags=degraded needs to be added to grub, degraded needs to be added to fstab and even udev needs to be modified so it doesn’t indefinitely wait for the missing/faulty drive (I didn’t even manage to achieve this last part actually).

The point is that I’ve read comments on the internet writing about the dangers of continously running rootflags=degraded and fstab degraded. Like disks being labeled as degraded when they shouldn’t or split-brain scenarios, but they don’t really elaborate much further or I don’t understand it. And as you can read almost anything on the internet I was hoping for:

  1. Someone here with proper knowledge could explain me what are the actual specifics risks and consquences of running BTRFS like that. Like what would be the actual dangerous scenarios, how we would reach them and what would be the consequences (slow system? failure to boot? data loss?..)
  2. A proper/official/reliable source talking about the actual reasons of why BTRFS is not recommended to run in a degraded-unattended way.

And in the end, if BTRFS is actually not the tool for the job I would like to know what are the risks of running openSUSE LEAP on ZFS, since there is community support but not the official one. So it would like to know how the combination works together in a production environment, like if it’s a non-issue or if there are compatibility or reliability issues.

Or more in general, if there is a better way (mdadm?) to configure LEAP to this type of work. Thanks for your time.

Hi
Just a comment, would it not be better to run hardware RAID so the operating system just sees a block device?

Worth a read https://btrfs.readthedocs.io/en/latest/Volume-management.html

1 Like
  1. btrfs raid1 profile mounted read-write in degraded mode will silently create chunks with single profile. After replacing the failed device these chunks are not automatically converted to the raid1 and so are not protected. Subsequent failure of the device with these chunks means data loss.
  2. btrfs does not check if device is up-to-date as long as it appears to match the metadata, so if the missing device suddenly reappears after running for some time, btrfs will happily attempt to use it - but its content will no more match with unpredictable results (which is partially mitigated by btrfs run-time checks, but it means you will get spurious errors accessing data in the best case).

Search btrfs mailing list archives.

1 Like

Probably,

But this is industrial hardware with no pci(e) expansion. Simply a SATA and a NVMe port.

EDIT: I will read your sources and come back to you. THanks.

@H25E I assumed it was going to be something industrial… Nothing in the BIOS? I have some Dell systems here that have Intel RAID can use with SATA/NVMe.

There are devices that do RAID for M.2 devices in 2.5" cases, I see a Startech “Dual M.2 SATA Adapter with RAID - TAA”

Implement a robust OTA strategy → A/B partitions

If you dislike BTRFS → Replace all BTRFS partitons by XFS or SquashFS. Remove all unsupported file systems.

Mount the active root partition (A or B) with readonly flag. Remove all databases from root partition => remove rpmdb.
https://rpm.org/user_doc/db_recovery.html

Use a linux distribution with transactional updates.

Thanks for all your answers. I will try to answer you all to the best of my capabilities:

@arvidjaar

Nice to know, but this could be fixed by re-balancing the BTRFS volume remotely after changing the disk right?

Interesting. What will happen then in this case when BTRFS gets data from the OK drive and the staled drive and they don’t match? How is that case managed in BTRFS? I know that ZFS is designed to take as truth the most up-to-date drive.

@malcolmlewis

To be fair, I didn’t think at all about MB RAID because I always heard bad things about it and simply forget about it. Anyway, you made me look and there is no option there.

Very interesting concept. Didn’t even think about something like that.

However, quoting from the very Q&A section of the product page in the manufacturer website:

Q: Hello, if one of the drives fails while in RAID 1 configuration, is there a way to find out about the failure and which drive failed through software?

A: There is no way to check via software regarding the health of a raid configuration with the 25S22M2NGFFR. An alternative, would be to look at the LEDs inside the 25S22M2NGFFR to determine if there are any drive errors. Atha, StarTech.com Support

And mixed reviews:

Using the RAID 1 configuration, I had a drive fail and the Adapter did not illuminate the LEDs as they should have been to indicate a drive failure. I had to remove the bad drive before the device could be recognized by any computer. I expected the adapter to continue to work after a single drive failure in the RAID 1 configuration. It defeats the purpose of having a RAID 1 array if the still working redundant disk cannot be accessed while the other disk has failed.

But I will look for better products that follow the same concept. So thanks for that.

@GrandDixence2
This seems to bring things to a higher level. But even then you will need a software raid for the persistent data, with the advantage that if it fails the system still boots. I will think about it, thanks for the idea.

Thank you everybody. Glad to hear from you back if you have any more feedback or new ideas.

Yes. But it requires unallocated space on both disks for new chunks so it may fail even if there is enough free space otherwise.

I am not exactly sure what you are asking, but checksums are part of metadata; if btrfs reads stale metadata their checksum from the same disk most likely will match, but generation number will be wrong. Something like well known

transid verify failed on 592680566784 wanted 321278 found 320215

To my best knowledge in this case btrfs simply fails request and does not retry with another copy.

But even then you will need a software raid for the persistent data, with the advantage that if it fails the system still boots.

A good backup of the persistent data is better than every hardware or software RAID. RAID or not, that is the question. Partition A on drive A, partition B on drive B or a RAID solution?

Here an entry point to the world of immutable linux (A/B process, OSTree, and BTRFS snapshots):

@GrandDixence2 I think @H25E is looking at a uptime and reliability issue, not so much data backup for a remote device…

@arvidjaar

Sorry I was a little slow here. I was thinking how BTRFS would behave in a split-brain scenario but I was thinking in a scenario where BTRFS would read all the needed bytes from both disks and compare them and check that they don’t match, but that’s probably not the case at all, since reads would be distributed between different drives.

I’m not an expert by no means, but that sounds like wasted potential. If the array has enough redundancy and something goes wrong, why not go look into another place for that same data?

@GrandDixence2 Thanks, I will have it present

@malcolmlewis
Yes, the most important bits are uptime and “progressive” failure, which would help with planned maintenance instead of reactive maintenance. Data backup also has a role because remote places sometimes have connectivity problems which can imply that the local data copy can be the only one at least temporally. But usually that works just fine.

From what I’m seeing, ZFS seems just much more mature for unattended processes and arrays management. It’s a shame because of its license incompatibility with linux but it is what it is. There is someone that tried openZFS on LEAP on production? Or it’s something mostly for consumer/desktop use?

What I’m most worried about is having an update breaking ZFS/Linux kernel compatibility if the ZFS community projects lags behind after a new kernel release and the system gets bricked needing local intervention. Does anyone have any experience here?

It’s a pitty because it feels like ZFS would solve all my problems but seems to severely lack official linux distros support and I don’t know if using something community backed for something like that is a crazy thing or not.

Go to Ubuntu Server 24.04 LTS if you want ZFS support for productive systems:
https://wiki.ubuntu.com/ZFS

What I’m most worried about is having an update breaking ZFS/Linux kernel compatibility if the ZFS community projects lags behind after a new kernel release and the system gets bricked needing local intervention.

And take the x.y.0 LTS kernel from Ubuntu. => 5 years silence. 5 years without any kernel updates which breaks ZFS compatibility.

Do not do any (open-)ZFS experiments with community driven linux distribution like openSUSE, Debian and so on.

Interesting.

I thought that Ubuntu started to drop ZFS on 22.04, but seems like they only deprecated zsys and ZFS support was more or less maintained. However it seems like ZFS on rootfs is only supported on dekstop installer. But probably root ZFS compatibility on Ubuntu vs Ubuntu Server should be the same, right?

Also, Ubuntu seems to have natively tools to sync multiple ESPs, which I probably would trust more than my custom made code.

So it seems like the final decision is between using Ubuntu (Server or Desktop removing the undesired packages) or set a system with opensuse + openZFS with limited updates (no kernel or ZFS updates, only safety ones)

Would both approachs be feasible?

or set a system with opensuse + openZFS with limited updates (no kernel or ZFS updates, only safety ones)

Linux kernel is a monster of over 50 millions lines of code (LOC). Rule of thumb: 1 security hole in 1000 lines of code (LOC). Enough space for many many security holes in Linux kernel.
https://www.heise.de/en/news/Linux-Criticism-reasons-and-consequences-of-the-CVE-flood-in-the-kernel-9963850.html

All enterprise linux distributions receive monthly linux kernel security updates.

Check with (Open-)SCAP for missing security updates (checkSecurity.sh):
https://forums.opensuse.org/t/measures-to-harden-an-opensuse-install-and-to-run-an-opensuse-system-securely/180953/17

I cannot comment about ZFS part, but kernel updates in stable distributions should be (upward) kABI compatible. To make use of it kernel modules have to be packaged as KMP though.

Okey, got it. I ditch the opensuse + ZFS idea.

If I need to distro hop and start the Ubuntu journey so I can use ZFS I will. But I think I was missing an option, MDADM.

How a ZFS mirror on Ubuntu would compare to a MDADM mirror on OpenSUSE? Aside from not having checksum obviously, which would be partially solvable by installing BTRFS on top of MDADM, which sould allow to detect but not fix checksum errors.

Would MDADM in general be feasible for something like this?

Thanks for your time

Well, the Show filesystems / zfs - openSUSE Build Service packages zfs as KMP. You could at least give it a try.

Sorry, I didn’t read your message when I posted mine.

How safe the should in your message is? How probable is that I end with a bricked remote box for using openZFS on openSUSE LEAP?

It’s LEAP a slow enough distribution that wouldn’t make kernel and ZFS go out of sync even if there is some lag from the community repo?

There is any difference if I update my remote box with zypper patch instead of zypper upgrade?

Failure to be kABI compatible is a bug. Bugs happen. I personally am not aware of such cases.

1 Like