Status of BTRFS

Hello,

I wanted to ask the community what the current status of the BTRFS is. I used it briefly on LEAP 42.1 which ended up in disaster in combination with meltdown and spectre. I never truly gave BTRFS a fair chance before considering it as “part of the problem” in LEAP 42.1.

I started using OpenSUSE again with LEAP 42.3 and I’ve been tying myself to ext4 eversince then and I still am using ext4 on my LEAP 15.1 I really do not wish to take a chance with my data. Since I am only using the distro on personal programming/work laptops with no encryptions and etc. I really do not see the point of having the extra featues that BTRFS bring to me yet.

In addition, something that worries me is that I can see that there are still frequent complaints about BTRFS, including a post about how to fix btrfs maintenance, and system freezing. In terms of filesystem for a personal unencrypted device, how stable and mature would you say BTRFS is at the moment?

I may give it a chance again with LEAP 15.2 but since BTRFS gives me advantage for me personally, I may wait a few more years.

In my opinion the end user should never ever have to think about the FS, at all - it should just work without any intervention. The fact that btrfs forces the user to do manual maintenance in any circumstance already shows it to be a failure.

IMO: It’s just plain horror. I stopped using it everywhere and only ever either use ext4 or XFS. That being said, 15.1 was a great release - works like a charm so I assume 15.2 with ext4/xfs will do exactly the same.

I am still using “ext4”. I seem to recall that there were some “btrfs” bugs back in 42.1, which are probably solved by now.

I am using “btrfs” in a virtual machine testing Leap 15.2 Beta. I have not had any issues. I previously did that when testing Leap 15.1 Beta. If I go by how much space is used, as shown by “df” (run just after a large update), then it is doing better now than it was with 15.1. But, for now, I’ll stay with “ext4” on my real machines.

Honestly from all the posts in this very forums and the FAQ from BTRFS wiki ((https://btrfs.wiki.kernel.org/index.php/FAQ?fbclid=IwAR095cTS_HyIw6fSquQSXCdegzKUDlqwIu51QvkW65WbvL9usSgqkla7yg4#I_have_a_problem_with_my_btrfs_filesystem.21) and (https://btrfs.wiki.kernel.org/index.php/Problem_FAQ)) don’t really raise my confidence in BTRFS.

Is there a particular reason why BTRFS is currently a default FS for Leap?

Hi
Your choice :wink: Because the openSUSE release maintainers have selected as the default…

My experience has been the opposite here for a number of years with all my systems running btrfs, some run snapper, some don’t, mainly because have had no issues that need rolling back (even here on Tumbleweed). All data is on xfs all filesystems using the blk mq i/o scheduler.

I’ve used btrfs since moving to TW in 2018. I have never had an unrecoverable issue with btrfs, but the btrfs implementation improvements in 2019 smoothed out the few issues I did have. Mind you, I haven’t tried to use a hard drive smaller that 1TB for quite some time, and most problems reported are for small hard drives/partitions. My last two installations have used the entire drive for operation system and /home, and worked flawlessly.

The primary (perhaps only) advantage of using btrfs, in my opinion, is the ability to quickly roll back snapshots, not so much so for distribution updates gone wrong (because that hasn’t happened in a long time), but after installing non-opensuse software that goes wrong.

In any case, if you make use decisions based on posts to the forum, you probably won’t use any software that has a support forum :). Maybe go with the open source MS-DOS that was released (pre-internet so no problems).

What are the issues you were dealing with 2018 that have been addressed?

Personally, I haven’t dealt with a problem that would benefit from snapshots since I clone stable systems to external drives, not to mention that we’re talking about rolling back ~100s of GBs. I’m perfectly fine using software under development as long as it won’t jeopardize my data which BTRFS in Leap 42.1 did.

I guess what I am trying to answer is if BTRFS is ready for handling and storing data as reliable as ext4 on RAID0 with no encryption for an instance. If I understood you correctly, you’re using BTRFS for root and system functions but xfs for home and data?

Hi
Correct, which is what btrfs is for, it is not for data, eg database web servers etc that has and always should be the likes of xfs… that is not what btrfs is for, just the operating system. You can use it for the likes of /home for creating user snapshots, but never used it for that. I do run my $HOME on btrfs but it’s not used to store any data just some config stuff, rest is softlinks to data on xfs.

I see. Honestly, maybe when I switch to LEAP 15.2, I may give it a BTRFS root and keep ext4 for home. I personally just had a terrible “taste” of btrfs when I used to have a singe partition root, home, data etc. On LEAP 42.1 I was dealing with a bug from BTRFS and the bug completely hindered my ability to work at a critical time. I didn’t bother trying to fix it and I just went back to OpenSUSE 13.2 until LEAP 42.3

Regarding snapshots, in my case, I typically have my system in a ~100GB partition in an SSD and I clone it to a seperate drive, which I find more suitable for backup.

There however reports of BTRFS being slower and less space efficient (net storage usage due to snapshots but not for individual small files). Is that still true? Is the general consensus that the BTRFS is stable and trusted for an OS at least?

Hi
On this desktop (Tumbleweed, no nvme boot support) I have;


lsblk

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda           8:0    0 232.9G  0 disk 
├─sda1        8:1    0   260M  0 part /boot/efi
├─sda2        8:2    0   768M  0 part /boot <<== btrfs
├─sda3        8:3    0   230G  0 part /stuff <<== xfs
└─sda4        8:4    0   1.9G  0 part [SWAP]
nvme0n1     259:0    0 232.9G  0 disk 
├─nvme0n1p1 259:1    0    40G  0 part / <<== btrfs
└─nvme0n1p2 259:2    0 192.9G  0 part /data <<== xfs

You asked about the current status of BTRFS, so what about this very recent message in the mailing list?

From: Zygo Blaxell
To: Justin Engwer
Cc: linux-btrfs@vger.kernel.org
Subject: Re: I think he’s dead, Jim
Date: Tue, 19 May 2020 21:32:55 -0400
Message-ID: <20200520013255.GD10769@hungrycats.org> (raw)
In-Reply-To: <CAGAeKuuvqGsJaZr_JWBYk3uhQoJz+07+Sgo_YVrwL9C_UF=cfA@mail.gmail.com>

On Mon, May 18, 2020 at 01:51:03PM -0700, Justin Engwer wrote:
> Hi,
>
> I’m hoping to get some (or all) data back from what I can only assume
> is the dreaded write hole. I did a fairly lengthy post on reddit that

Write hole is a popular scapegoat; however, write hole is far down the
list of the most common ways that a btrfs dies. The top 6 are:

  1. Firmware bugs (specifically, write ordering failure in lower storage
    layers). If you have a drive with bad firmware, turn off write caching
    (or, if you don’t have a test rig to verify firmware behavior, just turn
    off write caching for all drives). Also please post your drive models and
    firmware revisions so we can correlate them with other failure reports.
  1. btrfs kernel bugs. See list below.
  1. Other (non-btrfs) kernel bugs. In theory any UAF bug can kill
    a btrfs. In 5.2 btrfs added run-time checks for this, and will force
    the filesystem read-only instead of writing obviously broken metadata
    to disk.
  1. Non-disk hardware failure (bad RAM, power supply, cables, SATA
    bridge, etc). These can be hard to diagnose. Sometimes the only way to
    know for sure is to swap the hardware one piece at a time to a different
    machine and test to see if the failure happens again.
  1. Isolation failure, e.g. one of your drives shorts out its motor as
    it fails, and causes other drives sharing the same power supply rail to
    fail at the same time. Or two drives share a SATA bridge chip and the
    bridge chip fails, causing an unrecoverable multi-device failure in btrfs.
  1. raid5/6 write hole, if somehow your filesystem survives the above.

A quick map of btrfs raid5/6 kernel bugs:

2.6 to 3.4:  don't use btrfs on these kernels
3.5 to 3.8:  don't use raid5 or raid6 because it doesn't exist
3.9 to 3.18:  don't use raid5 or raid6 because parity repair
code not present
3.19 to 4.4:  don't use raid5 or raid6 because space_cache=v2
does not exist yet and parity repair code badly broken
4.5 to 4.15:  don't use raid5 or raid6 because parity repair
code badly broken
4.16 to 5.0:  use raid5 data + raid1 metadata.  Use only
with space_cache=v2.  Don't use raid6 because raid1c3 does not
exist yet.
5.1:  don't use btrfs on this kernel because of metadata
corruption bugs
5.2 to 5.3:  don't use btrfs on these kernels because of metadata
corruption bugs partially contained by runtime corrupt metadata
checking
5.4:  use raid5 data + raid1 metadata.    Use only with
space_cache=v2.  Don't use raid6 because raid1c3 does not
exist yet.  Don't use kernels 5.4.0 to 5.4.13 with btrfs
because they still have the metadata corruption bug.
5.5 to 5.7:  use raid5 data + raid1 metadata, or raid6 data
+ raid1c3 metadata.  Use only with space_cache=v2.

On current kernels there are still some leftover issues:

- btrfs sometimes corrupts parity if there is corrupted data
already present on one of the disks while a write is performed
to other data blocks in the same raid stripe.  Note that if a
disk goes offline temporarily for any reason, any writes that
it missed will appear to be corrupted data on the disk when it
returns to the array, so the impact of this bug can be surprising.
- there is some risk of data loss due to write hole, which has an
effect very similar to the above btrfs bug; however, the btrfs
bug can only occur when all disks are online, and the write hole
bug can only occur when some disks are offline.
- scrub can detect parity corruption but cannot map the corrupted
block to the correct drive in some cases, so the error statistics
can be wildly inaccurate when there is data corruption on the
disks (i.e. error counts will be distributed randomly across
all disks).  This cannot be fixed with the current on-disk format.

Never use raid5 or raid6 for metadata because the write hole and parity
corruption bugs still present in current kernels will race to see which
gets to destroy the filesystem first.

Corollary: Never use space_cache=v1 with raid5 or raid6 data.
space_cache=v1 puts some metadata (free space cache) in data block
groups, so it violates the “never use raid5 or raid6 for metadata” rule.
space_cache=v2 eliminates this problem by storing the free space tree
in metadata block groups.

> you can find here:
> BTRFS died last night. Pulling out hair all day. Need a hand. : btrfs
>
> TLDR; BTRFS keeps dropping to RO. System sometimes completely locks up
> and needs to be hard powered off because of read activity on BTRFS.
> See reddit link for actual errors.

You were lucky to have a filesystem with raid6 metadata and presumably
space_cache=v1 survive this long.

It looks like you were in the middle of trying to delete something, i.e.
a snapshot or file was deleted before the last crash. The metadata
is corrupted, so the next time you mount, it detects the corruption
and aborts. This repeats on the next mount because btrfs can’t modify
anything.

My guess is you hit a firmware bug first, and then the other errors
followed, but at this point it’s hard to tell which came first. It looks
like this wasn’t detected until much later, and recovery gets harder
the longer the initial error is uncorrected.

> I’m really not super familiar, or at all familiar, with BTRFS or the
> recovery of it.
> –
>
> Justin Engwer

I’d like to comment on other’s comments, but first let me clarify one thing with you.

You do have a clone of the filesystem (both system and data I guess?). Why are you so concerned to the point of choosing any other filesystem? And do you know how rollbacks work? Why do you think you’d need to rollback ~100s of GBs?

For practically everything in that email message,
You can find an answer in official BTRFS documentation.

I’ve listed what I consider the most important parts of BTRFS documentation on my personal Wiki

https://en.opensuse.org/User:Tsu2/systemd-1#BTRFS

Most of that email focuses on a single bug (That’s a lot of print dedicated to a single item).
Is in my links, but a summary of that is…
Yes, the RAID parity corruption bug exists even today but has shown up only in “large” RAID arrays (RAID5).
If you’re a really big, Enterprise use, then you need to be aware of it.
If you’re a home or SOHO user and your RAID array is rarely more than 3 or 4 disks, then it’s not something that should happen to you… But as long as that bug exists, I assume that is why RHEL won’t support BTRFS in any situation.
You should also know though that BTRFS very recently did a “soft release” of its own new RAID implementation which instead of using a parity bit deploys a kind of “extended mirroring.” So, that’s what BTRFS is now proposing for larger arrays… just do it without a parity bit and BTRFS is providing the tools to setup, deploy and maintain it. It’s very new, so although it looks like it should be problem-free, you can’t know for sure about any technology until it’s been used extensively.

I don’t use openSUSE like most people, so I won’t say that my personal experience would apply to all others,
But I will note that there are very few BTRFS related issues posted today compared to 3 years ago.
Although IMO the default snapshot retention policy still needs to be fixed or modified properly once and for all, it has been improved so that people don’t seem to have problems with it as often as before. But don’t think that this is a unique BTRFS problem, I recently ran into a similar problem on an Arch system when whoever originally built the system didn’t install Timeshift. Although built on ext4, after running fine for about 3 years the system became unbootable and without a working backup or snapshot recovery system, the owner is SOL and will have to recover the data and re-build the system(all rolling releases should have a snapshot recovery).

Only more general openSUSE problem I’ve experienced (They’re all on BTRFS) I’ve run into is related to the fact that I sometimes build a system (VM) for a specific purpose and may sit unused for a very long time. If a system hasn’t been updated for over 9 months, I find the system could become unbootable after attempting to update. Since these machines I build are specifically purposed, it’s usually less of a bother for me to simply build a new machine than try to determine what broke. But, I doubt most people are using their machines this way… leaving a machine untouched for so long.

Note that if you use BTRFS as it’s installed and configured by default on openSUSE, you’ll be set up with a number of benefits you won’t have by default on any other filesystem although you can certainly build these features in on your own…

  • automatic snapshots and rollback/rollforward.
  • Not just entire partition rollback/rollforward, you can restore individual files
  • A journaling filesystem, this means that although not guaranteed if your system suffers a sudden shutdown in the middle of a file operation you have a chance of recovery. Not guaranteed but better chance than other fielsystems.
  • automatic self-fixing. Hard to tell how well this works in comparison to other filesystems, but only a couple have this feature.

And more (see the Wiki links I provide).

HTH,
TSU

You have hardware and firmware that are prone to failure right? How do filesystems can cope with that? Pick one and see how it does. (You mention programming, so think like a programmer)

The btrfs maintenance issue I find it was a bit of lack of interest / manpower to do what needs to be done. The issue is by-design of systemd, whose maintainers don’t seem to care about, triggered by misconfiguration of the btrfs systemd service. Regarding the system freeze, if you’re referring to the recent post (which I attempt to help) it’s too soon to lay blame. On HDD though, I know that the scrub routine may cause the system to freeze. Btrfs has matured a lot, but there are still areas to improve, even the kernel I/O scheduling one may say is not mature enough.

Would you mind explaining which manual maintenance are you talking about? Which commands or buttons the user have to push?

For home users, I find btrfs great for /home and it can also be used for VMs and databases, given copy-on-write is disabled, which it’s the case for /var on new installations.

Snapshots are not backup, since they share the same extents on disk (until data is changed on the live filesystem). They can provide older version of files that were deleted/replaced by mistake, but they can’t help in case a hardware failure. Snapshots do provide a great backup mechanism with the send/receive feature. Learn more in the wiki: Incremental Backup - btrfs Wiki

Btrfs is performant enough and you can control how many snapshots are kept by snapper.

Also, since you mention the use case for RAID0, you may find interesting this technical report on RAID5, whose author is not even pro-btrfs (pro-zfs instead): Battle testing ZFS, Btrfs and mdadm+dm-integrity

I could go on… but I feel this is good for a day. Good luck!

Search these forums for BTRFS issues and you’ll find plenty, especially related to smaller drives/partitions - we get these almost on a daily basis in IRC as well.

There is no excuse for the default filesystem in an OS to end up in an unbootable state just because it has decided to run out of space without automatic cleanup and having to resort to rescue media to fix it. There’s a reason why most of the desktop distributions have completely dropped btrfs and SUSE should do exactly the same when the user chooses the installation to be a desktop install. More advanced users should be able to choose it but to have it as default is pure insanity that I just cannot comprehend.

In comparison, I can’t remember the last time I had to manually do any sort of maintenance on ext4 or XFS.

Honestly, I feel somewhat of a passive aggressive overtone in your post. For an instance, I mentioned nowhere about hardware or firmware failure. I have had SDD processor die due to bad filesystem protocol (caused by dynamic MBR disk using NTFS partitions). I am simply referring to data loss due to a bug in filesystem. In addition to potential hardware and firmware failure, last thing I wish is is to introduce an unstable FS to cause data loss and I am wondering if BTRFS is in general considered reliable at the moment.

I understand snapshots are not backups of the system, which is why I am wondering if it “wastes” space on the drive especially if I have limited storage space.

I am not sure if you understood but the purpose of mentioning RAID0. The point is that I use a system with no redundancy for the sake of storage efficiency which has nothing to do with RAID5.

I know them (not on IRC, but on this forum since I started visiting) and I addressed this already in this thread.

If you want the default reverted you can make a case on bugzilla. Software needs to be used to be improved, and in fact btrfs has improved a lot last few years from what I gather, possibly because it’s the default filesystem on oS.

I’m sorry if you felt that way. I’m glad I didn’t post the sarcastic version though :wink:

As for snapshots, they play a role in:

  1. systems reliability: a snapshot store the previous version of data, so for a system update/upgrade, they store a maximum amount of data that were overwritten/deleted. Successive snapshots add up, but not many snapshots need to be kept, and limits are configurable with snapper. They exist to be able to rollback to a previous version, as rarely as they are needed.
  2. (incremental) backups: to send data to a secondary media one read-only snapshot must be made. A previous snapshot, if kept around, decreases the payload to the amount that differs among snapshots. They don’t exist to be rolled back, but to synchronize drives (as backups).

If you don’t have use for them (because you do full partition clones) then you can disable them. I hope that clears up the “rolling back ~100s of GBs”.

I brought up the report on RAID5 not to explain the space efficiency aspect of a filesystem, but to add to the “Status of BTRFS” as per thread subject since RAID5 is a superset of RAID0 and you mentioned “as reliable as ext4”.

Btrfs is the result of an ongoing effort to implement a modern Linux native filesystem. XFS and ext4 have been around for longer, and production-ready for longer. The recency aspect of bugs in the filesystem may discourage you from trying again. It is subjectively better than the others right now, and I have focused on the technical merits of it. TW would be tougher without it. And if it works for TW, why can’t it work for Leap?

openSUSE developers* are extremely thick-headed when it comes to decisions like this. You can have 10,000 users complain about it and their answer will be “You are wrong, we are right and it’s not going to chance because we make the decisions and not you”. If I had the time and interest, I would dig the openSUSE mailing list posts where this stuff was thrown around and essentially the bottom line was that users of oS are test dummies who are thrown under a bus to see what works and what doesn’t - if you don’t like it, use another distribution.

*In fact majority of open source developers are like this. They live in their little bubbles where they agree on things and if they find that someone doesn’t share their views, they are extremely hostile towards them.

Automatic cleanup works well here:


erlangen:~ # btrfs filesystem usage -T /
Overall:
    Device size:                  59.45GiB
    Device allocated:             17.03GiB
    Device unallocated:           42.42GiB
    Device missing:                  0.00B
    Used:                         14.81GiB
    Free (estimated):             43.62GiB      (min: 43.62GiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:               43.30MiB      (used: 0.00B)
    Multiple profiles:                  no

             Data     Metadata System              
Id Path      single   single   single   Unallocated
-- --------- -------- -------- -------- -----------
 1 /dev/sdb5 15.00GiB  2.00GiB 32.00MiB    42.42GiB
-- --------- -------- -------- -------- -----------
   Total     15.00GiB  2.00GiB 32.00MiB    42.42GiB
   Used      13.81GiB  1.01GiB 16.00KiB            

erlangen:~ # snapper list
    # | Type   | Pre # | Date                     | User | Used Space | Cleanup | Description              | Userdata     
------+--------+-------+--------------------------+------+------------+---------+--------------------------+--------------
   0  | single |       |                          | root |            |         | current                  |              
1318  | single |       | Sat May 16 08:36:22 2020 | root |  67.56 MiB | number  | rollback backup of #1279 | important=yes
1319* | single |       | Sat May 16 08:36:22 2020 | root |  83.74 MiB |         | writable copy of #1279   |              
1320  | pre    |       | Sat May 16 13:13:57 2020 | root |  28.77 MiB | number  | zypp(zypper)             | important=yes
1321  | post   |  1320 | Sat May 16 13:18:54 2020 | root | 464.00 KiB | number  |                          | important=yes
1322  | pre    |       | Sat May 16 13:19:21 2020 | root | 272.00 KiB | number  | yast bootloader          |              
1323  | post   |  1322 | Sat May 16 13:19:44 2020 | root | 432.00 KiB | number  |                          |              
1324  | pre    |       | Sun May 17 07:25:01 2020 | root |   6.64 MiB | number  | zypp(zypper)             | important=yes
1325  | post   |  1324 | Sun May 17 07:27:45 2020 | root |  14.00 MiB | number  |                          | important=yes
1326  | pre    |       | Mon May 18 06:56:09 2020 | root |   8.98 MiB | number  | zypp(zypper)             | important=no 
1327  | post   |  1326 | Mon May 18 06:57:08 2020 | root |   7.22 MiB | number  |                          | important=no 
1336  | pre    |       | Mon May 18 09:48:10 2020 | root |  19.16 MiB | number  | yast sysconfig           |              
1337  | pre    |       | Wed May 20 00:12:33 2020 | root |  14.11 MiB | number  | zypp(zypper)             | important=no 
1338  | post   |  1337 | Wed May 20 00:16:23 2020 | root |   6.30 MiB | number  |                          | important=no 
1339  | pre    |       | Wed May 20 17:36:08 2020 | root |   2.55 MiB | number  | zypp(zypper)             | important=no 
1340  | post   |  1339 | Wed May 20 17:36:14 2020 | root |   1.64 MiB | number  |                          | important=no 
1341  | pre    |       | Thu May 21 05:31:56 2020 | root |   3.00 MiB | number  | zypp(zypper)             | important=no 
1342  | post   |  1341 | Thu May 21 05:32:19 2020 | root |   2.83 MiB | number  |                          | important=no 
1347  | pre    |       | Thu May 21 09:13:47 2020 | root | 576.00 KiB | number  | yast sysconfig           |              
1348  | post   |  1347 | Thu May 21 09:14:27 2020 | root |  16.00 KiB | number  |                          |              
1349  | pre    |       | Thu May 21 09:14:29 2020 | root |  16.00 KiB | number  | yast sysconfig           |              
1350  | post   |  1349 | Thu May 21 09:15:11 2020 | root | 112.00 KiB | number  |                          |              
1351  | pre    |       | Sat May 23 06:47:12 2020 | root |  15.67 MiB | number  | yast lan                 |              
1352  | pre    |       | Sat May 23 08:01:22 2020 | root | 736.00 KiB | number  | zypp(zypper)             | important=yes
1353  | post   |  1352 | Sat May 23 08:01:28 2020 | root | 672.00 KiB | number  |                          | important=yes
1354  | pre    |       | Sun May 24 06:00:55 2020 | root |  14.84 MiB | number  | zypp(zypper)             | important=no 
1355  | post   |  1354 | Sun May 24 06:04:21 2020 | root |  10.52 MiB | number  |                          | important=no 
1356  | pre    |       | Sun May 24 12:45:32 2020 | root |   1.08 MiB | number  | zypp(zypper)             | important=yes
1357  | post   |  1356 | Sun May 24 12:45:38 2020 | root |   1.38 MiB | number  |                          | important=yes
erlangen:~ # 

A minor bug has been fixed and will be available in Tumbleweed soon: https://github.com/kdave/btrfsmaintenance/pull/81 Add the following service and enable btrfsmaintenance-refresh.path

erlangen:~ # systemctl cat btrfsmaintenance-refresh.service 
# /etc/systemd/system/btrfsmaintenance-refresh.service
[Unit]
Description=Update cron periods from /etc/sysconfig/btrfsmaintenance

[Service]
ExecStart=/usr/share/btrfsmaintenance/btrfsmaintenance-refresh-cron.sh systemd-timer
Type=oneshot
erlangen:~ #