Btrfs / mdadm RAID Issue: Missing LUKS Header

Hello,

About a year ago, I created a Btrfs-based RAID1 using the YaST Partitioner. I also enabled the disk encryption option during the process.

Now, all of a sudden, the RAID array cannot be opened with LUKS anymore.
Why is this happening all of a sudden?

The error message for the RAID /dev/md/NAS reads:

LUKS keyslot 4 is invalid.
Device /dev/md/NAS is not a valid LUKS device.

It’s also worth noting that keyslot 4 wasn’t even used. Only keyslots 1 and 2 were set…
The output from mdadm --detail and mdadm --examine shows the following:

nasserver:/home/admin # mdadm --detail /dev/md/NAS

/dev/md/NAS:

Version : 1.0

Creation Time : Mon May 22 06:00:38 2023

Raid Level : raid1

Array Size : 15625879360 (14.55 TiB 16.00 TB)

Used Dev Size : 15625879360 (14.55 TiB 16.00 TB)

Raid Devices : 3

Total Devices : 3

Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Wed Jul 31 01:48:24 2024

State : clean

Active Devices : 3

Working Devices : 3

Failed Devices : 0

Spare Devices : 0

Consistency Policy : bitmap

Name : any:NAS

UUID : bbf107f6:69f4d874:565b9db2:bd1a37b7

Events : 604873

Number Major Minor RaidDevice State

0 8 32 0 active sync /dev/sdc

1 8 0 1 active sync /dev/sda

2 8 16 2 active sync /dev/sdb

nasserver:/home/admin # mdadm --misc --examine --verbose /dev/sd[abc]

/dev/sda:

Magic : a92b4efc

Version : 1.0

Feature Map : 0x1

Array UUID : bbf107f6:69f4d874:565b9db2:bd1a37b7

Name : any:NAS

Creation Time : Mon May 22 06:00:38 2023

Raid Level : raid1

Raid Devices : 3

Avail Dev Size : 31251759016 sectors (14.55 TiB 16.00 TB)

Array Size : 15625879360 KiB (14.55 TiB 16.00 TB)

Used Dev Size : 31251758720 sectors (14.55 TiB 16.00 TB)

Super Offset : 31251759088 sectors

Unused Space : before=0 sectors, after=296 sectors

State : clean

Device UUID : e65eef70:fe0e7627:2a07bd6a:2ae8cd7c

Internal Bitmap : -72 sectors from superblock

Update Time : Wed Jul 31 01:48:24 2024

Bad Block Log : 512 entries available at offset -8 sectors

Checksum : d8035600 - correct

Events : 604873

Device Role : Active device 1

Array State : AAA (‘A’ == active, ‘.’ == missing, ‘R’ == replacing)

/dev/sdb:

Magic : a92b4efc

Version : 1.0

Feature Map : 0x1

Array UUID : bbf107f6:69f4d874:565b9db2:bd1a37b7

Name : any:NAS

Creation Time : Mon May 22 06:00:38 2023

Raid Level : raid1

Raid Devices : 3

Avail Dev Size : 35156656040 sectors (16.37 TiB 18.00 TB)

Array Size : 15625879360 KiB (14.55 TiB 16.00 TB)

Used Dev Size : 31251758720 sectors (14.55 TiB 16.00 TB)

Super Offset : 35156656112 sectors

Unused Space : before=0 sectors, after=3904897320 sectors

State : clean

Device UUID : 4eefa158:7e983fa6:9b63c49b:636d4179

Internal Bitmap : -72 sectors from superblock

Update Time : Wed Jul 31 01:48:24 2024

Bad Block Log : 512 entries available at offset -8 sectors

Checksum : 3d7a5196 - correct

Events : 604873

Device Role : Active device 2

Array State : AAA (‘A’ == active, ‘.’ == missing, ‘R’ == replacing)

/dev/sdc:

Magic : a92b4efc

Version : 1.0

Feature Map : 0x1

Array UUID : bbf107f6:69f4d874:565b9db2:bd1a37b7

Name : any:NAS

Creation Time : Mon May 22 06:00:38 2023

Raid Level : raid1

Raid Devices : 3

Avail Dev Size : 35156656040 sectors (16.37 TiB 18.00 TB)

Array Size : 15625879360 KiB (14.55 TiB 16.00 TB)

Used Dev Size : 31251758720 sectors (14.55 TiB 16.00 TB)

Super Offset : 35156656112 sectors

Unused Space : before=0 sectors, after=3904897320 sectors

State : clean

Device UUID : f187ec83:6ce4ae41:e5e52705:0f0fc0da

Internal Bitmap : -72 sectors from superblock

Update Time : Wed Jul 31 01:48:24 2024

Bad Block Log : 512 entries available at offset -8 sectors

Checksum : cf165a1a - correct

Events : 604873

Device Role : Active device 0

Array State : AAA (‘A’ == active, ‘.’ == missing, ‘R’ == replacing)

According to mdadm, all checksums are correct. A SMART test with the manufacturer’s tools also confirmed that all hard drives are error-free. Therefore, I don’t believe it’s a hardware issue.

I rolled back the system with Snappy to an earlier state where the RAID was still functioning without any issues. However, this didn’t make any difference. So, I don’t believe the problem is caused by faulty installed software.

I was able to determine that there is a valid LUKS header at the beginning of the drive with device number 2 (Device Role: Active device 2, currently /dev/sdb). This header is recognized when the drive is accessed directly without mdadm.

Therefore, I created a new mdadm RAID array using mdadm --create. Now, the drive with the LUKS header is in the first position. The path was changed from /dev/sdX to /dev/mapper/sdX, as I am now working with overlays of sdX from this point on:

mdadm --create /dev/md/NAS --assume-clean --level=1 --metadata=1.0 --raid-devices=3 /dev/mapper/sdb /dev/mapper/sdc /dev/mapper/sda

mdadm: /dev/mapper/sdb appears to be part of a raid array:

level=raid1 devices=3 ctime=Mon May 22 06:00:38 2023

mdadm: /dev/mapper/sdc appears to be part of a raid array:

level=raid1 devices=3 ctime=Mon May 22 06:00:38 2023

mdadm: /dev/mapper/sda appears to be part of a raid array:

level=raid1 devices=3 ctime=Mon May 22 06:00:38 2023

mdadm: largest drive (/dev/mapper/sdb) exceeds size (15625879360K) by more than 1%

Continue creating array? y

When creating the array, the error message “largest drive exceeds size by more than 1%” is displayed. However, since no actual reformatting is being done, this shouldn’t be a problem. Or am I mistaken?

Now, the RAID array can be mounted again using LUKS.

A test with btrfs check shows that the filesystem is corrupt and not all checksums are found. However, the data itself is still accessible.

btrfs check --check-data-csum /dev/mapper/data

Opening filesystem to check…

Checking filesystem on /dev/mapper/data

UUID: b7398a14-99e6-4245-b152-85d4a22e7155

[1/7] checking root items

[2/7] checking extents

[3/7] checking free space cache

[4/7] checking fs roots

root 259 inode 17447 errors 1000, some csum missing

root 7968 inode 17447 errors 1000, some csum missing

root 8540 inode 17447 errors 1000, some csum missing

root 10472 inode 17447 errors 1000, some csum missing

root 11035 inode 17447 errors 1000, some csum missing

root 12429 inode 17447 errors 1000, some csum missing

root 13718 inode 17447 errors 1000, some csum missing

root 14494 inode 17447 errors 1000, some csum missing

root 15513 inode 17447 errors 1000, some csum missing

root 17341 inode 17447 errors 1000, some csum missing

root 19777 inode 17447 errors 1000, some csum missing

root 19897 inode 17447 errors 1000, some csum missing

root 20018 inode 17447 errors 1000, some csum missing

root 20139 inode 17447 errors 1000, some csum missing

root 20260 inode 17447 errors 1000, some csum missing

root 20380 inode 17447 errors 1000, some csum missing

root 20501 inode 17447 errors 1000, some csum missing

root 20621 inode 17447 errors 1000, some csum missing

root 20743 inode 17447 errors 1000, some csum missing

root 20864 inode 17447 errors 1000, some csum missing

root 20869 inode 17447 errors 1000, some csum missing

root 20874 inode 17447 errors 1000, some csum missing

root 20879 inode 17447 errors 1000, some csum missing

root 20884 inode 17447 errors 1000, some csum missing

root 20889 inode 17447 errors 1000, some csum missing

root 20894 inode 17447 errors 1000, some csum missing

root 20899 inode 17447 errors 1000, some csum missing

root 20904 inode 17447 errors 1000, some csum missing

root 20909 inode 17447 errors 1000, some csum missing

root 20914 inode 17447 errors 1000, some csum missing

root 20919 inode 17447 errors 1000, some csum missing

root 20924 inode 17447 errors 1000, some csum missing

root 20929 inode 17447 errors 1000, some csum missing

root 20934 inode 17447 errors 1000, some csum missing

root 20939 inode 17447 errors 1000, some csum missing

root 20944 inode 17447 errors 1000, some csum missing

root 20949 inode 17447 errors 1000, some csum missing

root 20954 inode 17447 errors 1000, some csum missing

root 20959 inode 17447 errors 1000, some csum missing

root 20964 inode 17447 errors 1000, some csum missing

root 20969 inode 17447 errors 1000, some csum missing

ERROR: errors found in fs roots

found 4455135825920 bytes used, error(s) found

total csum bytes: 4337800852

total tree bytes: 12935102464

total fs tree bytes: 4029333504

total extent tree bytes: 3621240832

btree space waste bytes: 2327492901

file data blocks allocated: 444362249580544

referenced 10789058260992

What does “errors 1000” mean? Is that the number of errors?

And were all Btrfs blocks that have checksums free of errors?

Now I have the following questions:

  1. Should I continue with the current approach and let Btrfs fix the filesystem errors using btrfs fix? If not, what would be a better approach?
  2. I don’t quite understand why mdadm is being used at the block device level. Btrfs should be managing the RAID, shouldn’t it? Or is another RAID system being used after the LUKS encryption, this time managed by Btrfs? Shouldn’t Btrfs be aware of the physical block devices under the LUKS encryption layer?
  3. What happens if the drive containing the LUKS header fails? In a RAID1 setup, a drive failure should be possible without data loss. Or is the LUKS header also stored on another drive? I’ve already made a backup of the LUKS header, but unfortunately, I have no idea how to restore it if that drive fails.

Thank you in advance for answering my questions.

My System:

NAME=“openSUSE Leap”
VERSION=“15.6” ID=“opensuse-leap”
ID_LIKE=“suse opensuse”
VERSION_ID=“15.6”
PRETTY_NAME=“openSUSE Leap 15.6”
ANSI_COLOR=“0;32”
CPE_NAME=“cpe:/o:opensuse:leap:15.6”

It is inode error flags which are printed in plain text: “some csum missing”.

Those blocks that btrfs check examined - probably. It is possible to turn off checksums for individual files. In your case you apparently have filesystem corruption on one inode which had been replicated to every snapshot. Can you mount this filesystem?

mount -t btrfs -o rescue=all,ro /dev/mapper/data /mnt

No. You should never attempt to use btrfs check --repair unless you are absolutely sure you understand the cause and scope of the problem and the consequences of “fixing” it. Which btrfs check and its man page warn you about. The rule is simple - if you need to ask this question, you should not use it.

Sorry? It is your system, you set it up. If anything, we should be asking you why you set it up this way.

I do not understand this question. You skipped some steps and other steps are really poorly readable due to strange markup. Always post computer output as preformatted text.

What drives are you talking about?

Hello @forrest96er ,

My understanding, and my experience, is that btrfs will eventually fail when installed on mdraid. I made the mistake of running btrfs on mdraid a few years back and could not figure out why btrfs kept failing.

If you want redundancy on btrfs, you should use the raid built into btrfs, which is probably only mature enough for raid 1 equivalency.

https://btrfs.readthedocs.io/en/latest/Status.html

I’m sorry I can’t help with your immediate problem. When I had a similar problem, I never did figure out how to recover the corrupted btrfs.

Cheers

This looks like a recipe for disaster. The entire point of raid is that the filesystem used on top will “see” the array as a single block device, usually under “/dev/md”. If you create a luks layer on top of that, the unencrypted block device will be in “/dev/mapper”. The luks header will reside on all disks, not one in particular.

In RAID1 array all drives should have identical content (at least, those blocks that has been written). If content is different, it is already a strong indication of hardware or software problem.

Your drive may have decided to wipe out some blocks, e.g. on power failure. Or to silently re-allocate them (losing their original content). There are many (hard|firm)ware issues besides pure “this HDD completely failed”. It is impossible to rule out hardware.

Try to compare the first blocks where LUKS header is located. Are they completely random? Do they have some pattern (e.g. all zeroes)?

You may try running mdadm --action=check to estimate how far drives content differs.

@arvidjaar ,

It appears that he already put the drives/partitions into a new md device, using open luks containers as the elements. If that device was started, mdadm would have resynced. To make the setup usable, I think he needs to restore the original configuration at a minimum.

Yes it is possible to mount file system. Access to random files works flawlessly.

I used YaST Partitioner with encyption enabled. I did not know what it did in the background.

I ment if one of the drives in my RAID fails, will I still be able to recover the LUKS header? Before making any changes to the RAID configuration with mdadm, I created overlays of the drives. For example, the path for /dev/sda has now changed to /dev/mapper/sda.

Before making any changes to the RAID configuration with mdadm, I created overlays of the drives. For example, the path for /dev/sda has now changed to /dev/mapper/sda . I did not create a new mdadm RAID after encryption with LUKS.

By “overlay”, do you mean you used “cryptsetup open” on each block device?

No, I created Overlays to prevent any actual write operations to the disk while still allowing writes to the block device.

DEVICES=“/dev/sdc /dev/sda /dev/sdb”
parallel ‘test -e /dev/loop{#} || mknod -m 660 /dev/loop{#} b 7 {#}’ ::: $DEVICES
parallel truncate -s17000G overlay-{/} ::: $DEVICES
parallel ‘size=$(blockdev --getsize {}); loop=$(losetup -f --show – overlay-{/}); echo 0 $size snapshot {} $loop P 8 | dmsetup create {/}’ ::: $DEVICES

dmsetup status

sda: 0 31251759104 snapshot 24/35651584000 16
sdb: 0 35156656128 snapshot 24/35651584000 16
sdc: 0 35156656128 snapshot 24/35651584000 16

I must admit I do not understand the “overlay” well, but it certainly looks like a cool trick :slight_smile: .

Perhaps it is worth asking what you would really like to accomplish. If you have a good backup of the actual data, is it worth trying to restore the corrupted btrfs? I don’t know.

As I wrote earlier, I ran btrfs on a mdraid level 1 device four or five years ago and found myself with corrupted btrfs several times. At the time, I read somewhere (I wish I could remember where) that corruption was the expected result.

I did spend some time with google yesterday and failed to find any decent documentation about running btrfs on mdraid. The closest I found were a few anecdotal forum posts by folks claiming it works. I could not find even a mention of mdraid or mdadm in the btrfs documentation.

Unfortunately, that is the most help I can provide.

Best of luck!

1 Like

If you mean RAID1 - yes, you should. As I already said, every disk in RAID1 should have the same content. Of course, if content of some drive has changed below kernel, then kernel is not aware of it. Because kernel thinks the content of each drive is correct, it has no reasons to retry reading from another drive.

In the worst case you still have a good copy as you already found even if you need to use it manually. So, at the end RAID1 paid off :slight_smile:

In the current situation you will need to force resyncing of the RAID. You need to decide which drive has the correct data and tell Linux MD to sync others drives to it.

Good. So, you apparently have just one file with corruption. You can simply delete it.

Before doing it it makes sense to also run scrub on this filesystem.

First, I checked the LUKS header on each device:

cryptsetup luksDump /dev/sda

LUKS keyslot 4 is invalid.
Device /dev/sda is not a valid LUKS device.

cryptsetup luksDump /dev/sdb

LUKS header information for /dev/sdb

Version: 1
Cipher name: aes
Cipher mode: xts-plain64
Hash spec: sha256
Payload offset: 4096
MK bits: 512
MK digest: { removed }
MK salt: { removed }
MK iterations: 246375
UUID: 7915f86e-09bb-4490-bcaa-b6d4de666f85

Key Slot 0: ENABLED
Iterations: 3927250
Salt: { removed }
Key material offset: 8
AF stripes: 4000
Key Slot 1: ENABLED
Iterations: 4177592
Salt: { removed }
Key material offset: 512
AF stripes: 4000
Key Slot 2: DISABLED
Key Slot 3: DISABLED
Key Slot 4: DISABLED
Key Slot 5: DISABLED
Key Slot 6: DISABLED
Key Slot 7: DISABLED

cryptsetup luksDump /dev/sdc

LUKS keyslot 4 is invalid.
Device /dev/sdc is not a valid LUKS device.

Then, I compared the complete LUKS headers across all devices (offset 0-4096). It turns out that all the headers are identical, except for a difference at offset 440-443, as shown in the following pictures. I couldn’t find any information on what LUKS 1 stores at that offset. It appears to be intentional to me."

You completely ignored request to use preformatted text. As for your images - personally, I cannot read them, I cannot copy from them, I cannot convert them to binary to try to do something with them. Hopefully, someone finds them useful.

Anyway - something modified the content of these drives directly. For reasons outlined above I cannot even guess what it could be. So, I leave it someone else who can.

Sorry for the inconvenience, but I’m having trouble adjusting the format. Unfortunately it’s not possible to upload more than one picture or higher-resolution images, but you should be able to download the current picture. If you use the download button after selecting the image, you will get the highest resolution.

If you want to read more about the overlay I used:

https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file

I also had to install GNU Parallel on SUSE Linux. There is another parallel package available for SUSE, but unfortunately, it doesn’t work on SUSE Linux.

For more information about GNU Parallel:

https://www.gnu.org/software/parallel/

Hello @forrest96er ,

Thanks for the links. This definitely interests me as I do use luks2 on mdraid.

Cheers!