I have Leap 15.6 running on an old Fujitsu Primergy TX150 S8, which I’ve rebuilt into a dedicated file server; basically a beefy NAS.
The server has a MegaRAID SAS/SATA adapter card – actually a RAID controller – and two backplanes/enclosures, each with four 3.5" slots, but the card recognises those as a single enclosure with 8 slots. The MegaRAID card is administrated using a CLI tool called MegaCli. I’m not using its RAID features and always just create a dedicated “virtual disk”/“drive group” configuration for each disk.
The OS is running from an SSD, but all the actual data lives on a bunch of HDDs. Each of those disks is used as a whole device. It’s LUKS-encrypted and the device-mapped LUKS volume is then added to a single btrfs filesystem, which is configured as a raid1.
I have it set up so the LUKS volumes are unlocked automatically on boot before the data btrfs is mounted. The device file names of the unlocked LUKS volumes are all based on UUIDs, so they’re stable.
There used to be 4 disks in the system and I recently added a fifth disk, set it up the same way with LUKS, added the device-mapped unlocked LUKS volume to the btrfs and started a balance operation. The balance took ~32hrs and at some point while it was running, one of the old disks disappeared from the system!
What I mean by that is:
- the device file
/dev/sdbwas gone, - the disk didn’t show up in the list of physical devices
MegaClispits out, MegaClioutputs the same kind of message for the disk’s slot as for an empty slot and finally- the orange fault indicator LED was on, next to the disk’s slot in the enclosure.
I thought the disk had completely died, for sure.
Nonetheless, the LUKS device didn’t disappear and the balance operation continued and – supposedly – “finished successfully”. So btrfs filesystem shownow lists 5 devices belonging to the filesystem, although btrfs stats does show an enormous number of write errors, though, which makes sense.
Against my expectations, I was actually able to revive and reintegrate the disk into the system, although it showed up with a new device file /dev/sdf. But the disk is obviously not backing the unlocked LUKS volume, which is now a sort of “ghost”, so now I’m a bit stuck, wondering what to do next.
- I don’t just want to reboot the system assuming that will fix it, because I might end up in a worse situation.
- I could unmount the btrfs filesystem, close all the LUKS volumes (or at least the “ghost” one), reopen them and then mount the FS again. That should work to get this “ghost” volume back into operation, but the FS would still be in an inconsistent state from when the disk was missing. Would a balance and/or scrub be enough to fix that?
- I could remove the LUKS volume for the missing/readded disk from the filesystem with
btrfs device remove, run another balance, meanwhile close the LUKS volume, reopen it so it actually connects to the disk, then re-add it to the btrfs and run another balance. That should work, I think, although it would take a long time. - There’s
btrfs replace, but I don’t know whether that would do the trick, since it needs both a source and a target device… and they’re the same in my case.
Does anyone have any tips or pointers on what to do next in this situation?