openSUSE is my favorite Linux distribution because it’s modern, user friendly, and has a lot of good ideas in comparison to other distros. This is probably the first time I have to openly criticize a decision taken by the team, which I now believe puts users at risk without many even knowing it and should be urgently reviewed. I’d like to start with a bit of background on what exactly happened so people can understand my point and the reasons behind it.
A few days ago I bought my first SSD, on which I did a clean reinstall of openSUSE for the first time in 6 years. Previously I had the OS on my old mechanical drive on an ext4 partition which has always worked perfectly. Since I was doing a fresh install, I figured I’d use a modern setup and go with the partitioning scheme suggested by the installer… namely using btrfs as my filesystem and enabling shapshots (at the time I didn’t even know what those did but left the option defaulted to on). I then proceeded with the installation and everything seemed like it worked perfectly. Little did I know I had a hell waiting to unleash inside my computer pretty soon.
Within the following hours I suddenly found that my machine started freezing. I could tell it was not a GPU lockup but something slowing it down so much, even the NumLock / CapsLock leds couldn’t be toggled for minutes at a time (only the mouse cursor could be moved). I didn’t understand what’s happening at all and thought my new drive must be having hardware issues. Then I managed to go into another runlevel and run “top” to see what was eating my system resources. I found a few processes to be responsible, namely btrfs-cleaner / btrfs-transacti / snapperd. The next scary part was discovering that they couldn’t be killed and were forced onto my system (“kill -9 PID” was ineffective), thus I had to wait for over an hour before it stopped on its own. I also saw contradictory information: Top showed it was using 100% CPU but KSysGuard said it was only using 8%. I expressed concern about this in another thread and was told that’s usually a one-time event and should cease once the first “zypper dup” is cached by snapper.
That seemed to be the case as an entire day went without any problems. Then all of a sudden it started happening again last night, despite me not even making any new system changes to prompt it: Snapper and associated btrfs processes froze my system to the point where I had to wait 5 minutes to even open a small program, every click caused the machine to freeze completely for an entire minute. I tried watching a Youtube video hoping it would pass, which was itself difficult as the playback froze and sometimes websites stopped rendering entirely… 4 hours went by and it didn’t finish. When I looked in YaST - Snapper I saw it created a dozen snapshots in a matter of minutes, however all of them were empty and showed no changes even after they were marked as “pre & post”.
Eventually I decided to restart my computer. I issued the reboot command and waited for several minutes. When I saw that there’s still no response and I was now stuck in the shutdown splash screen, I pressed the reset button to fasten the process. Big mistake: That act alone corrupted my openSUSE installation and rendered it unusable. I was never able to boot it again from this point on: I remained stuck at the splash screen (without me even being able to switch to a different runlevel and input commands (control + alt + fN)) with the console throwing countless “dependency failed” messages for root directories plus drive timeout errors. This a photo I took with my phone showing the errors the boot process got stuck at each time:
https://i.imgur.com/DwmI8jU.jpg
I went to bed at 8 AM trying to fix it but to no avail. I could only boot a rescue console from the installation DVD, and although I was able to mount the root partition I couldn’t mount its subvolumes thus directories like /var were inaccessible.
Today I had to do another fresh install of openSUSE. This time I selected ext4 for my partition like before and made sure snapshots were never enabled again. I finished configuring it earlier and my OS now works quickly and flawlessly as it always has. The experience I went through was stressful to say the least as it all happened rapidly and came as a complete surprise. With this I’m hoping to offer my point to the developers and other users alike;
Users: Do NOT use btrfs and do NOT enable snapshots, unless you absolutely require them and know very well what you’re doing. This will make your life a hell and cause your installation to be destroyed eventually! I suggest sticking to ext4: I can confirm from years of experience that it’s trustworthy and reliable. Or other classic file systems that have been in development since the old days of Linux (I hear xfs and zfs are also good however I never used those).
Admins: If the team cares for the well being of users, please consider updating the installer so that it stops suggesting btrfs and system snapshots. Default back to ext4 and make snapshots an option for advanced users only. As I could confirm from my experience, snapper and btrfs are far too unstable and risky: A hard restart or a power outage at the wrong time (eg: while snapper is working) will cause the installation to break as it did for me. Snapper will also render the system unusable to the point of freezing it for minutes at a time, the machine may be inoperable for hours whenever it randomly decides to start working in the background (note that I have an SSD for root which is the fastest type of drive). I get the idea behind this tool which itself is pretty good, but right now both btrfs and snapper are too unstable to use and themselves pose a huge liability. I’d rather this doesn’t happen to more users before we can all agree something is wrong and a lot more work needs to be done on those tools before they are safe.