How to fix btrfs on LUKS volumes after disk lost temporarily

I have Leap 15.6 running on an old Fujitsu Primergy TX150 S8, which I’ve rebuilt into a dedicated file server; basically a beefy NAS.

The server has a MegaRAID SAS/SATA adapter card – actually a RAID controller – and two backplanes/enclosures, each with four 3.5" slots, but the card recognises those as a single enclosure with 8 slots. The MegaRAID card is administrated using a CLI tool called MegaCli. I’m not using its RAID features and always just create a dedicated “virtual disk”/“drive group” configuration for each disk.

The OS is running from an SSD, but all the actual data lives on a bunch of HDDs. Each of those disks is used as a whole device. It’s LUKS-encrypted and the device-mapped LUKS volume is then added to a single btrfs filesystem, which is configured as a raid1.

I have it set up so the LUKS volumes are unlocked automatically on boot before the data btrfs is mounted. The device file names of the unlocked LUKS volumes are all based on UUIDs, so they’re stable.

There used to be 4 disks in the system and I recently added a fifth disk, set it up the same way with LUKS, added the device-mapped unlocked LUKS volume to the btrfs and started a balance operation. The balance took ~32hrs and at some point while it was running, one of the old disks disappeared from the system!

What I mean by that is:

  • the device file /dev/sdb was gone,
  • the disk didn’t show up in the list of physical devices MegaCli spits out,
  • MegaCli outputs the same kind of message for the disk’s slot as for an empty slot and finally
  • the orange fault indicator LED was on, next to the disk’s slot in the enclosure.

I thought the disk had completely died, for sure.

Nonetheless, the LUKS device didn’t disappear and the balance operation continued and – supposedly – “finished successfully”. So btrfs filesystem shownow lists 5 devices belonging to the filesystem, although btrfs stats does show an enormous number of write errors, though, which makes sense.

Against my expectations, I was actually able to revive and reintegrate the disk into the system, although it showed up with a new device file /dev/sdf. But the disk is obviously not backing the unlocked LUKS volume, which is now a sort of “ghost”, so now I’m a bit stuck, wondering what to do next.

  1. I don’t just want to reboot the system assuming that will fix it, because I might end up in a worse situation.
  2. I could unmount the btrfs filesystem, close all the LUKS volumes (or at least the “ghost” one), reopen them and then mount the FS again. That should work to get this “ghost” volume back into operation, but the FS would still be in an inconsistent state from when the disk was missing. Would a balance and/or scrub be enough to fix that?
  3. I could remove the LUKS volume for the missing/readded disk from the filesystem with btrfs device remove, run another balance, meanwhile close the LUKS volume, reopen it so it actually connects to the disk, then re-add it to the btrfs and run another balance. That should work, I think, although it would take a long time.
  4. There’s btrfs replace, but I don’t know whether that would do the trick, since it needs both a source and a target device… and they’re the same in my case.

Does anyone have any tips or pointers on what to do next in this situation?

Why?

Did not it occur to you that there must be a reason for the device to fail? Did you investigate this reason?

Exactly. btrfs is rather bad at detecting outdated devices. Never attempt to mount btrfs filesystem where some device re-appeared after having dropped off. You are lucky that there was additional layer (LUKS) which protected you from attempting to access this device. This is sure recipe for the filesystem corruption.

If you still intend to use this disk, you should format it with cryptsetup luksFormat to be sure no traces of the original btrfs filesystem metadata remains. 4 is faster than 3 because it just needs to re-create the information from the missing disk. The end result is the same (btrfs replace is nothing more than a specialized btrfs balance).

You cannot use the same disk without wiping it first or otherwise changing its identity and after it’s wiped it is not the same disk.

And in general - we want to see the actual command output, not stories. In particular, btrfs has habit of creating single chunks on a raid1 profile with missing disk. Show

lsblk -f
btrfs filesystem usage -T /

Thank you very much for your response, @arvidjaar! :grinning:

Too restrictive and not as powerful as btrfs’s RAID features:

  1. You can’t use disks of uneven sizes in the array.
  2. RAID-1 via the adapter card is just mirroring, so you only get the capacity of a single disk. btrfs raid1 is block duplication, so you get half the total capacity. If you have three or more disks in the array, that’s a big difference. You can use RAID-10 to work around that, but it only works for even numbers of disks… another limitation.
  3. It’s finnicky to configure and the CLI tool is very unpleasant to use.

It did and I did.

To my dismay, I was not able to ascertain the root cause

But I did run an extended S.M.A.R.T. selftest and waited for it to finish (without errors, of course) before reintegrating the disk.

So you’re saying that simply re-adding the same device and running a full balance and/or scrub would not be enough to fix the problems.

And no other way to fix them exists, either. (I’m thinking btrfs check or btrfs rescue.)

OK. Would that change the UUID?

Because otherwise the unlocked LUKS volume would end up getting the same device file path again and I don’t know that btrfs replace would work, then.

Hmm… it’s interesting that you would phrase it that way, because I did think about this when writing my post.

See, I tried to explain not just the current situation, but what led to it – the background of the incident – including relevant details and leaving out irrelevant ones.

It’s hard to gauge that, though. So, for instance, I didn’t get into details of why I don’t use the adapter cards RAID features, because I thought it’s irrelevant… yet, that point was precisely one you were interested in and asked about.

And so the post got quite long and I feared it might turn people of and they might not read it all nor respond to it. That’s why I didn’t want to make it even longer by including CLI output.

But even if I had done that, it would have been stuff like the output from btrfs filesystem show (to illustrate that btrfs still considers five devices as part of the FS) and other commands I have run on my server, not the specific ones that you now want me to run.

So how could I have predicted those?

Maybe next time I’ll write a much, much terser 3-5 sentence description and include related output from 5-15 different commands I ran, and then someone else can tell me I’m doing forum wrong and they generally want to know what the story is…

Anyway, here’s the output you asked for:

# lsblk -f
NAME                                        FSTYPE      FSVER LABEL UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
sda                                         crypto_LUKS 1           b52cf99d-e8f4-4193-9fe0-694ce5b2b6d7                
└─luks-b52cf99d-e8f4-4193-9fe0-694ce5b2b6d7 btrfs             Main  6a40d64e-ff83-4f08-a5b2-1236dd4add01      4T    64% /mnt2/Archive
                                                                                                                        [...]
sdc                                         crypto_LUKS 1           e35e3681-c01d-44aa-9305-a5171f6da24f                
└─luks-e35e3681-c01d-44aa-9305-a5171f6da24f btrfs             Main  6a40d64e-ff83-4f08-a5b2-1236dd4add01                
sdd                                         crypto_LUKS 1           d3295051-2128-4282-98ec-d138622a8c09                
└─luks-d3295051-2128-4282-98ec-d138622a8c09 btrfs             Main  6a40d64e-ff83-4f08-a5b2-1236dd4add01                
sde                                         crypto_LUKS 1           b3956446-40bd-4546-a103-0b12ba71a9e6                
└─luks-b3956446-40bd-4546-a103-0b12ba71a9e6 btrfs             Main  6a40d64e-ff83-4f08-a5b2-1236dd4add01                
sdf                                         crypto_LUKS 1           06f1927d-7ace-4e73-ab1f-efa06970318b                
nvme0n1                                                                                                                 
├─nvme0n1p1                                                                                                             
├─nvme0n1p2                                 vfat        FAT16 EFI   47BE-CAEE                              61,3M     4% /boot/efi
├─nvme0n1p3                                 swap        1     SWAP  8629ac99-cc06-49d6-a01d-735452cfee86                [SWAP]
└─nvme0n1p4                                 crypto_LUKS 2           392205b5-14ca-4435-8b0a-1b3a03b67608                
  └─luks                                    btrfs             ROOT  1b210383-e04f-488a-9de8-a1354dcc879f  917,7G     1% /var
                                                                                                                        /usr/local
                                                                                                                        /tmp
                                                                                                                        /srv
                                                                                                                        /root
                                                                                                                        /opt
                                                                                                                        /home
                                                                                                                        /boot/grub2/x86_64-efi
                                                                                                                        /boot/grub2/i386-pc
                                                                                                                        /.snapshots
                                                                                                                        /

 # btrfs filesystem usage -T /mnt2/Main
Overall:
    Device size:                  27.29TiB
    Device allocated:             17.43TiB
    Device unallocated:            9.86TiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                         17.39TiB
    Free (estimated):              4.95TiB      (min: 4.95TiB)
    Free (statfs, df):             3.96TiB
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)
    Multiple profiles:                  no

                                                         Data    Metadata System                             
Id Path                                                  RAID1   RAID1    RAID1    Unallocated Total    Slack
-- ----------------------------------------------------- ------- -------- -------- ----------- -------- -----
 1 /dev/mapper/luks-b52cf99d-e8f4-4193-9fe0-694ce5b2b6d7 3.48TiB  4.00GiB 32.00MiB     1.97TiB  5.46TiB     -
 2 /dev/mapper/luks-06f1927d-7ace-4e73-ab1f-efa06970318b 3.48TiB  4.00GiB        -     1.97TiB  5.46TiB     -
 3 /dev/mapper/luks-e35e3681-c01d-44aa-9305-a5171f6da24f 3.48TiB  5.00GiB        -     1.97TiB  5.46TiB     -
 4 /dev/mapper/luks-d3295051-2128-4282-98ec-d138622a8c09 3.48TiB  5.00GiB 32.00MiB     1.97TiB  5.46TiB     -
 5 /dev/mapper/luks-b3956446-40bd-4546-a103-0b12ba71a9e6 3.48TiB  4.00GiB        -     1.97TiB  5.46TiB     -
-- ----------------------------------------------------- ------- -------- -------- ----------- -------- -----
   Total                                                 8.70TiB 11.00GiB 32.00MiB     9.86TiB 27.29TiB 0.00B
   Used                                                  8.68TiB 10.38GiB  1.22MiB             

I hope this is of any help to you. Personally, I don’t see how this shows whether there are single chunks on the raid1, or not.

I’m happy to run any other commands you might find helpful and post the output here.

There are not.

Where does it come from? Is LUKS device unlocked? It is not shown in the lsblk output:

Did you unlock it earlier? Which device is the faulty one?

ls -l /dev/mapper
btrfs filesystem show /mnt2/Main
btrfs device stats -T /mnt2/Main

This is the device that used to belong to the detached disk, before it went missing. It is still here from earlier, even though it is not “attached” to any actual disk anymore. It is present in the system:

# ls -l /dev/mapper/luks-06f1927d-7ace-4e73-ab1f-efa06970318b
lrwxrwxrwx 1 root root 7 Okt 21 12:21 /dev/mapper/luks-06f1927d-7ace-4e73-ab1f-efa06970318b -> ../dm-1
:~ # ls -l /dev/dm-1
brw-rw---- 1 root disk 254, 1 Okt 21 13:12 /dev/dm-1

… even though it is missing from the lsblk output.

Curious, I know.

It was unlocked at boot and so I guess it just stuck around when the actual disk (which was /dev/sdb) went missing.

When I re-inserted the disk (and re-imported the config for it into the adapter card, which considered it “foreign” for some reason), it got a different device file /dev/sdf, but the UUID remained the same, as you spotted.

I did not unlock the LUKS volume for the re-integrated disk.

/dev/mapper/luks-06f1927d-7ace-4e73-ab1f-efa06970318b is essentially a “phantom” or a “ghost” device… btrfs and Linux at large, seem to think it’s there, but using it only generates read/write errors.

# ls -l /dev/mapper
total 0
crw------- 1 root root 10, 236 Okt 21 13:12 control
lrwxrwxrwx 1 root root       7 Okt 21 12:21 luks -> ../dm-0
lrwxrwxrwx 1 root root       7 Okt 21 12:21 luks-06f1927d-7ace-4e73-ab1f-efa06970318b -> ../dm-1
lrwxrwxrwx 1 root root       7 Okt 21 13:08 luks-b3956446-40bd-4546-a103-0b12ba71a9e6 -> ../dm-5
lrwxrwxrwx 1 root root       7 Okt 21 12:21 luks-b52cf99d-e8f4-4193-9fe0-694ce5b2b6d7 -> ../dm-2
lrwxrwxrwx 1 root root       7 Okt 21 12:21 luks-d3295051-2128-4282-98ec-d138622a8c09 -> ../dm-3
lrwxrwxrwx 1 root root       7 Okt 21 12:21 luks-e35e3681-c01d-44aa-9305-a5171f6da24f -> ../dm-4

# btrfs filesystem show /mnt2/Main
Label: 'Main'  uuid: 6a40d64e-ff83-4f08-a5b2-1236dd4add01
        Total devices 5 FS bytes used 8.69TiB
        devid    1 size 5.46TiB used 3.49TiB path /dev/mapper/luks-b52cf99d-e8f4-4193-9fe0-694ce5b2b6d7
        devid    2 size 5.46TiB used 3.49TiB path /dev/mapper/luks-06f1927d-7ace-4e73-ab1f-efa06970318b
        devid    3 size 5.46TiB used 3.49TiB path /dev/mapper/luks-e35e3681-c01d-44aa-9305-a5171f6da24f
        devid    4 size 5.46TiB used 3.49TiB path /dev/mapper/luks-d3295051-2128-4282-98ec-d138622a8c09
        devid    5 size 5.46TiB used 3.49TiB path /dev/mapper/luks-b3956446-40bd-4546-a103-0b12ba71a9e6

# btrfs device stats -T /mnt2/Main
Id Path                                                  Write errors Read errors Flush errors Corruption errors Generation errors
-- ----------------------------------------------------- ------------ ----------- ------------ ----------------- -----------------
 1 /dev/mapper/luks-b52cf99d-e8f4-4193-9fe0-694ce5b2b6d7            0           0            0                 0                 0
 2 /dev/mapper/luks-06f1927d-7ace-4e73-ab1f-efa06970318b    117180430    58407321            0                 0                 0
 3 /dev/mapper/luks-e35e3681-c01d-44aa-9305-a5171f6da24f            0           0            0                 0                 0
 4 /dev/mapper/luks-d3295051-2128-4282-98ec-d138622a8c09            0           0            0                 0                 0
 5 /dev/mapper/luks-b3956446-40bd-4546-a103-0b12ba71a9e6            0           0            0                 0                 0

The output from the last command makes it clear which device is the one with the problem, as I had written in my original post.

So is your suggestion to:

  1. close the LUKS volume so that /dev/mapper/luks-06f1927d-7ace-4e73-ab1f-efa06970318b disappears,
  2. then wipe the device in a way that changes the UUID
  3. set it up to be automounted and mount it again
  4. and then run btrfs replace with /dev/mapper/luks-06f1927d-7ace-4e73-ab1f-efa06970318b as the source device and whatever the new one will be called as the target?

It will not disappear until it is held open by the mounted filesystem. And after you unmounted filesystem you may need to mount with option -o degraded.

I would say

  1. Unmount
  2. Reformat device, change UUID
  3. Mount
  4. Run btrfs replace. The original device will be missing and shown as missing.
  5. Setup automount if required.

But you may want to ask on the btrfs mailing list where developers are also listening. There may be something I did not consider. One problem is - btrfs caches devices in kernel. It may interfere.

Ah, those are some important points! Thank you.

Your plan makes sense and I think I’ll follow that approach, but I’ll consult the btrfs mailing list, to be on the safe side.

Thank you so much for your help, @arvidjaar!

I’ll report back here with my progress, although it may be a little while until I can actually work on the server again.

@gniebler Hi, this is just a comment in passing, I would expect on system boot there is an option to enter the firmware/bios setep on the controller? I’m on a Dell system with VROC and a MR9440-8i and I do all the disk setup there. It has a device configuration option. But on other systems I work on, there is a ctrl+c key press available.

E-h-h … I am not sure what I was thinking. Of course there is no need to unmount filesystem. Just format the disk and start device replacement. You will get a lot of new read errors, but it should work just like balance worked.

In any case, the sooner you remove the stale btrfs metadata from this disk the better.

MegaRAID keeps internal event log which you can download with megacli. If you need help in interpreting it, upload to the https://paste.opensuse.org/. It is probably better to create new topic in the Hardware section.

1 Like

This is true and I have used this in the past. I found the GUI that comes up quite usable for disk setup, but it has two important drawbacks (for me):

  1. It is a GUI, so it needs a monitor and ideally even mouse attached to the server, which I want to run as a headless system. An IP-KVM would help there, but I don’t have one of those (yet).
  2. I would have to reboot the system every time I want to add, remove or replace a disk. And one of the things I found attractive about this piece of kit is that I wouldn’t have to do that anymore, because it has an HDD enclosure with a backplane and drive caddies and you can just hotplug disks at runtime.

So I’m afraid it’s MegaCLI for me, for the foreseeable.

@gniebler what controller is it? If it’s like mine there are remote tools, but I just use Cockpit… Mine is the same, hot swap and caddies… :wink:
My current setup (as of today, it’s a test system running Leap 16.0…)
https://paste.opensuse.org/pastes/a6268f721f99

It’s a Fujitsu RAID Ctrl SAS 6G 5/6 512MB (D2616) with an LSI MegaRAID chip… pretty old kit by now, but it still seems to work.

I’ll have to look around for some more convenient tools after the current crisis is sorted out…

FWIW, I got confirmation from the btrfs LKML that this is the best approach.

I’ve been extremely busy with other stuff lately, but I’ll report back here as soon as I can actually work on this again.

As a first step, I marked the disk as offline with the MegaCLI utility. But that led to the disk completely disappearing, as if I had ripped it from the enclosure. To be clear: that’s not what should happen when you do this.

So I finally decided I had enough of this disk and replaced it with a new one. I set up full disk encryption, unlocked the LUKs volume and the btrfs replace operation is running now.

What’s mildly interesting is that I tried several times to use the full device file path /dev/mapper/luks-{UUID} as the source device and it didn’t work, but there was no error message: btrfs replace start ... just returned, but then btrfs replace status ... said Never started. The docs say the path should work for the source device, unless it’s disconnected. So apparently btrfs did recognise the device as disconnected, in some way.

Using the device ID worked though and the replace is progressing slowly, but surely.

Once it’s finished, I should probably run a btrfs balance and btrfs scrub against the FS, right? Anything else, in addition?

I do not see why btrfs balance would be necessary. The filesystem already is nicely balanced (at least, according to the output you provided earlier) and device replacement simply moves (or, in this case, reconstructs) data from one device on another device.

Scrub of course never hurts.

Well, I just thought because the last balance ran for about half of its time with a missing device, even though that didn’t stop it.

You mean because the “used” numbers in the output of btrfs filesystem show are all the same?

So the replace operation will make sure that the blocks btrfs thinks are on the failed disk will actually end up on the replacement disk, thus keeping the FS balanced.

That makes sense.

I think I’ll skip the balance operation, esp. since I’ll probably add yet another disk soon after this is all resolved.

Correct. Which is why device replacement is usually faster than device removal/device addition.

1 Like

The replace operation finished successfully and the FS is in a good state again.

Thank you @arvidjaar for your help!