The dangers of using btrfs and snapper

openSUSE is my favorite Linux distribution because it’s modern, user friendly, and has a lot of good ideas in comparison to other distros. This is probably the first time I have to openly criticize a decision taken by the team, which I now believe puts users at risk without many even knowing it and should be urgently reviewed. I’d like to start with a bit of background on what exactly happened so people can understand my point and the reasons behind it.

A few days ago I bought my first SSD, on which I did a clean reinstall of openSUSE for the first time in 6 years. Previously I had the OS on my old mechanical drive on an ext4 partition which has always worked perfectly. Since I was doing a fresh install, I figured I’d use a modern setup and go with the partitioning scheme suggested by the installer… namely using btrfs as my filesystem and enabling shapshots (at the time I didn’t even know what those did but left the option defaulted to on). I then proceeded with the installation and everything seemed like it worked perfectly. Little did I know I had a hell waiting to unleash inside my computer pretty soon.

Within the following hours I suddenly found that my machine started freezing. I could tell it was not a GPU lockup but something slowing it down so much, even the NumLock / CapsLock leds couldn’t be toggled for minutes at a time (only the mouse cursor could be moved). I didn’t understand what’s happening at all and thought my new drive must be having hardware issues. Then I managed to go into another runlevel and run “top” to see what was eating my system resources. I found a few processes to be responsible, namely btrfs-cleaner / btrfs-transacti / snapperd. The next scary part was discovering that they couldn’t be killed and were forced onto my system (“kill -9 PID” was ineffective), thus I had to wait for over an hour before it stopped on its own. I also saw contradictory information: Top showed it was using 100% CPU but KSysGuard said it was only using 8%. I expressed concern about this in another thread and was told that’s usually a one-time event and should cease once the first “zypper dup” is cached by snapper.

That seemed to be the case as an entire day went without any problems. Then all of a sudden it started happening again last night, despite me not even making any new system changes to prompt it: Snapper and associated btrfs processes froze my system to the point where I had to wait 5 minutes to even open a small program, every click caused the machine to freeze completely for an entire minute. I tried watching a Youtube video hoping it would pass, which was itself difficult as the playback froze and sometimes websites stopped rendering entirely… 4 hours went by and it didn’t finish. When I looked in YaST - Snapper I saw it created a dozen snapshots in a matter of minutes, however all of them were empty and showed no changes even after they were marked as “pre & post”.

Eventually I decided to restart my computer. I issued the reboot command and waited for several minutes. When I saw that there’s still no response and I was now stuck in the shutdown splash screen, I pressed the reset button to fasten the process. Big mistake: That act alone corrupted my openSUSE installation and rendered it unusable. I was never able to boot it again from this point on: I remained stuck at the splash screen (without me even being able to switch to a different runlevel and input commands (control + alt + fN)) with the console throwing countless “dependency failed” messages for root directories plus drive timeout errors. This a photo I took with my phone showing the errors the boot process got stuck at each time:

https://i.imgur.com/DwmI8jU.jpg

I went to bed at 8 AM trying to fix it but to no avail. I could only boot a rescue console from the installation DVD, and although I was able to mount the root partition I couldn’t mount its subvolumes thus directories like /var were inaccessible.

Today I had to do another fresh install of openSUSE. This time I selected ext4 for my partition like before and made sure snapshots were never enabled again. I finished configuring it earlier and my OS now works quickly and flawlessly as it always has. The experience I went through was stressful to say the least as it all happened rapidly and came as a complete surprise. With this I’m hoping to offer my point to the developers and other users alike;

Users: Do NOT use btrfs and do NOT enable snapshots, unless you absolutely require them and know very well what you’re doing. This will make your life a hell and cause your installation to be destroyed eventually! I suggest sticking to ext4: I can confirm from years of experience that it’s trustworthy and reliable. Or other classic file systems that have been in development since the old days of Linux (I hear xfs and zfs are also good however I never used those).

Admins: If the team cares for the well being of users, please consider updating the installer so that it stops suggesting btrfs and system snapshots. Default back to ext4 and make snapshots an option for advanced users only. As I could confirm from my experience, snapper and btrfs are far too unstable and risky: A hard restart or a power outage at the wrong time (eg: while snapper is working) will cause the installation to break as it did for me. Snapper will also render the system unusable to the point of freezing it for minutes at a time, the machine may be inoperable for hours whenever it randomly decides to start working in the background (note that I have an SSD for root which is the fastest type of drive). I get the idea behind this tool which itself is pretty good, but right now both btrfs and snapper are too unstable to use and themselves pose a huge liability. I’d rather this doesn’t happen to more users before we can all agree something is wrong and a lot more work needs to be done on those tools before they are safe.

We are all users here, same as you. The Admins on the forum, the Moderators, and the other Users on this Forum in most cases have no more ability to update the installer than you do.

If you have such wishes, you must take them to the Developers, not to the users here who cannot do anything about it.:wink:

Hi
Well my experience with btrfs and the GNOME DE has been fine, for multiple years on multiple systems. AFAIK, by default snapshots are not enabled these days if btrfs is selected.

Not sure of your SSD brand…if it’s Samsung, then IMHO they suck big time (Seems a few forum users have had issues in the past from memory with this brand), some are still blacklisted in the kernel because of the file system issues, and that’s not just btrfs…

It’s an ADATA in my case, which I understand are currently the best when it comes to SSD’s. I’m not sure how unique my experience is, as if it happened to everyone the same way I’m sure it would be much more known… but I think what I saw highlights some deep flaws with btrfs and / or snapshots, which at least for some will endanger their system. Also for me “snapshots” was enabled by default for the root partition proposed by the installer, which is why I left it on assuming the installer knows best.

On Mon 12 Nov 2018 10:16:03 PM CST, MirceaKitsune wrote:

malcolmlewis;2886178 Wrote:
> Hi
> Well my experience with btrfs and the GNOME DE has been fine, for
> multiple years on multiple systems. AFAIK, by default snapshots are
> not enabled these days if btrfs is selected.
>
> Not sure of your SSD brand…if it’s Samsung, then IMHO they suck big
> time (Seems a few forum users have had issues in the past from memory
> with this brand), some are still blacklisted in the kernel because of
> the file system issues, and that’s not just btrfs…

It’s an ADATA in my case, which I understand are currently the best when
it comes to SSD’s. I’m not sure how unique my experience is, as if it
happened to everyone the same way I’m sure it would be much more
known… but I think what I saw highlights some deep flaws with btrfs
and / or snapshots, which at least for some will endanger their system.
Also for me “snapshots” was chosen by default for the root partition
proposed by the installer, which is why I left it on assuming the
installer knows best.

Hi
Hmm, my recent Tumbleweed install on newly acquired laptop with a 40GB /
partition it was not checked. I don’t bother with snapshots anymore as
it’s been stable for me, I’m only worried about my data :wink:

I would suggest and email to ADATA support asking about support for
btrfs, xfs and ext4 filesystems and see what they say.


Cheers Malcolm °¿° SUSE Knowledge Partner (Linux Counter #276890)
SLES 15 | GNOME Shell 3.26.2 | 4.12.14-25.25-default
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below… Thanks!

That would be a little silly I think; The drive behaves as expected, there aren’t any problems with it on ext4 so far. I don’t see how it could be manufacturer specific as drives work in a generic way. Maybe the GSATA3 controller chip has its quirks?

I’ve always had issues with disk I/O being very slow (one process uses the HDD too much, half the system goes into “disk sleep” mode). That might have been part of what happened here, not sure. But I imagine everyone else can confirm this one: Some schedulers are trying to improve it but the issue can’t be fully resolved overall.

On Mon 12 Nov 2018 10:56:02 PM CST, MirceaKitsune wrote:

malcolmlewis;2886186 Wrote:
> I would suggest and email to ADATA support asking about support for
> btrfs, xfs and ext4 filesystems and see what they say.

That would be a little silly I think; The drive behaves as expected,
there aren’t any problems with it on ext4 so far. I don’t see how it
could be manufacturer specific as drives work in a generic way. Maybe
the GSATA3 controller chip has its quirks?

I’ve always had issues with disk I/O being very slow (one process uses
the HDD too much, half the system goes into “disk sleep” mode). That
might have been part of what happened here, not sure. But I imagine
everyone else can confirm this one: Some schedulers are trying to
improve it but the issue can’t be fully resolved overall.

Hi
Yes, that’s the other thing I change, scsi_mod.use_blk_mq=1


Cheers Malcolm °¿° SUSE Knowledge Partner (Linux Counter #276890)
SLES 15 | GNOME Shell 3.26.2 | 4.12.14-25.25-default
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below… Thanks!

Already using that too. Hoping to see it defaulted by the kernel soon. Can’t say I sense a big difference but I hear it’s more modern and will be faster.

FWIW Using btrfs for a couple of years now, using snapshotting a lot ( install pkgs, test, previous snapshot ), and have no complains, even consider moving my entire SSD to one single btrfs partition.

From your description,
I’d speculate and am fairly certain that your problem is from a root partition that’s too small to support multiple large snapshots.
And sorry, your photo hurts my eyes too much to try to study its contents… So, there is a chance there’s something there that would invalidate my speculation.

So,
Let’s say you installed your entire system on a 128GB SSD with a default layout.
I haven’t done this, but I’ll guestimate that your swap will be about 20GB and your /home about 40GB, leaving about 60GB for the root partition.
IMO that’s too tiny for practical use, IMO the root partition should be at least 100GB to reasonably avoid running out of disk space using the default snapper configuration.
And, as Malcolm notes the cutoff seems to be 50GB when the Install recognizes the root is too small to support snapshots and installs both root and /home on one partition instead of a partition for each. If your root partition is only a little bit larger, the installer won’t protect you from a possibly insufficiently small root partition.

It’s unusual for a new User to run out of root partition disk space within 12 or 24hrs, but it’s not impossible, particularly if you are performing a major system update (which you absolutely should do unless you installed from online sources) and installed several large personal applications.

If the above roughly describes your storage parameters,
It’s similar to the “single app” virtual machines I routinely create,
In my case the primary solution is to modify the default layout to install in a single partition, freeing up to 40GB which would normally be locked into /home so it’s available for the root partition and its snapshots.

But,
Consider that my “single app” virtual machines don’t do a lot of installing and uninstalling which might happen for an ordinary User’s multi-use machine… So, consider that not only with every update but also every install and uninstall a snapshot is being made to enable a rollback if necessary. So, you may still run into storage limitations.

In the end,
Your best option may be either installing EXT4 or if you install BTRFS to disable snapshots.
Consider also if you installed Tumbleweed (You didn’t say whether you installed TW or LEAP), BTRFS with TW enables rolling back possible faulty updates. With LEAP, your chances of a faulty update is much smaller so it’s safer to choose EXT4 or disable snapshots.

HTH,
TSU

Regarding drive space: My SSD (root partition) is 256 GB big. With the final packages installed, I’m using less than 80 GB. This should mean that almost 170 GB of free space were left. I don’t know how the snapshot system uses that.

As for version: I installed Leap 15 then upgraded it to Tumbleweed. The issue was noticed when I had already finished configuring my system thus switched all packages to TW.

Hi
Ahh, I’m guessing the switch may have created large snapshots which probably needed a manual cleanup perhaps. Now that is not the filesystems fault, snapper and snapshots :wink:

I have 60GB, 128GB and one 240GB SSD’s here with /home on btrfs, separate data partition… all my setups are 40GB for /.

On one Tumbleweed system I see ~6.5GB allocated but this is a recent install. Other Tumbleweed system has 12.5GB allocated. This system is running SLES 15 with 10GB allocated. My openSUSE Leap 42.3 system has 9.3GB allocated. My SLED 12 SP3 system which has been through numerous upgrades and has snaper running has 24GB allocated. Have four other laptops here, not sure where they at… need to fire them up, plus four RPi3’s with 16GB sd cards two running SLES 15 running btrfs…

Anyway, probably all a moot point since you switched to ext4 :wink:

The first system freeze happened a few hours after the switch, but it finished after two hours or so… that one was likely calculating the large upgrade. The second one surpassing four hours started happening out of nowhere two days later, after I haven’t upgraded anything again IIRC… that still puzzles me. But yes: I no longer have the install to test, as resetting my computer while snapper was busy broke the OS and it never booted again… ext4 is the only option that will feel safe for a long time.

Just as a general comment:

I tried “btrfs” for a while, running Tumbleweed. I did not run into the problems that you describe. However, I decided that “btrfs” doesn’t do anything for me, so I reinstalled with “ext4”.

I had trouble with btrfs-transi using 100% cpu and locking everything up a couple of months ago. The disk was a Kingston 120GB ssd and I was running Leap 42.3. This was not a new installation and it started happening “out of the blue”. As with the OP, I pressed reset after I couldn’t get the PC to reboot on one occasion and the result was a corrupted disk. I reinstalled with Leap 15.0 and used ext4. (Although I’m pretty sure there it at least one other computer here using btrfs).

Can’t say much else except I’m pretty well convinced I would have checked disk space after the first or second of these occurrences.

Had a somewhat similar experience a few years ago with a relatively small and slow disk, then reverted back to EXT4 and no snapshots on all my home systems and have been happy since.
Writing this from a laptop with a Samsung SSD with no glitch in years, so I would not blame the disk in the first instance, even if some old Samsung models had issues with outdated firmware in the past.
Maybe there is something in the configuration of that troubled install that makes problems worse, maybe size or layout of disk, frequent reboots or updates or whatever.
My understanding is the following:

  • expert admins (like Malcolm for sure) can avoid troublesome configs and have years of smooth operation even with btrfs and snapshots;
  • on systems that are frequently rebooted, like laptops, btrfs (with snapshots?) is likely to slow down the boot process and the first few minutes of desktop operation due to btrfs maintenance processes;
  • a relatively small / root partition (less than 40 GB these days) is likely to cause trouble with snapshots (the Forums are filled with such cases);
  • with servers or other 24/7 systems you have better to setup btrfs maintenance to trigger when system load is light;
  • you may be lucky enough that the default installer settings do not cause troubles as bad as those reported here and you just live with that.
    In a nutshell: you can have btrfs and snapshots and smooth operation if you know what you are doing or if you are lucky, but I agree that the default installer config is a risk for first time users.

Hi
I would also add if running an upgrade whether one release to another, some maintenance/backup etc is done if running snapper and probably temporarily disable during the upgrade process…

And also to tune the /etc/snapper/configs/root number snapshots down (default) if enabled timeline ones tune that as well…

Thanks for your input. I’m glad others could confirm similar experiences: It would have been odd if I was the only one seeing this, as I didn’t do anything special in my installation that I can think of.

Part of this thread was to ask whether you think it’s a good idea to make future versions of the installer suggest ext4 by default and leave btrfs as an option for advanced users only. Many of us don’t know about the dangers this poses, such as btrfs becoming corrupted if you reset the machine at the wrong time apart from the 100% CPU issues… I could say I’m experienced with Linux today as I’ve used openSUSE for over 6 years, yet even I had no idea about this and it just hit me in the head as it all started happening suddenly. I’d imagine that if a less experienced user decided to try openSUSE for the first time and this happened to them, they’re going to assume it’s a dangerous distribution and never touch it again, thus the OS might even lose users this way.

My understanding is that SUSE invested heavily in BTRFS since that probably makes sense in a corporate environment, so they are not likely to give up the BTRFS default in their SLE systems.
Since we openSUSE users are sort of Guinea Pigs for SLE, maybe the BTRFS default is the price we must pay for the benefit of having a rock-solid distribution (at least when installed properly).
Long story short, I think that changing the default to EXT4 is a good idea with a low likelihood of being accepted.
Maybe we have better chances asking to tune the default along the lines Malcolm is suggesting, say snapshots off by default, lower number of snapshots configured and the like, so that the average install does not freeze for hours…

Well, to be honest installing Leap and then dist-upgrading to TW with snapshots on is something special, or at least not something every newcomer is expected to do…