Tumbleweed after Update to Kernel 3.17.1-52.1 has corrupted Btrfs Root Partition

consused · October 25, 2014, 4:23pm

Earlier this week my Tumbleweed, with root partition on btrfs from oS 12.3 through 13.1, finally fell after update “kernel-desktop-3.17.1-52.1.g5c4d099-x86_64” did its worst. Unfortunately, this being the first major Tw or btrfs failure I’ve had, it happens close to the end of its current life cycle.

Previously with kernel-desktop-3.17.1-51.1, it ran trouble-free for several hours, with no relevant error messages (/var/log/messages). Re-booting after the kernel update (3.17.1-52.1), it seemed to proceed normally right through into KDE4’s desktop, but gui applications such as YaST, Dolphin, internet browsers, etc., etc., were all unusable as system files reported to be non-writeable, according to desktop error messages. In other words the root file system (including /home) had become read-only. Command line access through Konsole or tty was possible but limited to query or display commands e.g rpm -q, zypper search or list repos, etc., whereas zypper remove failed or refresh failed repo by repo. The system is effectively rendered useless and unmaintainable!

Mounting the btrfs partition from a standard oS 13.1 system enabled easier investigation with its KDE4 but superuser Dolphin, etc., failed to provide any write access to the partition. Direct chroot access just confirmed the read-only status.

/var/log/messages contained many entries like this one after the initial “BTRFS info” message:

...kernel:    22.259911] BTRFS info (device sda8): disk space caching is enabled
...kernel:    23.318003] parent transid verify failed on 949858304 wanted 186937 found 186939

Running “btrfsck /dev/sda8” (from 13.1) on the unmounted partition, it reported many errors. However, 13.1 doesn’t have latest version of btrfsprogs. Since my Tumbleweed partition includes no important user data, I took the last resort and ran btrfsck --repair. It eventually aborted with this:

Extent back ref already exists for 998006784 parent 24822198272 root 0 
Well this shouldn't happen, extent record overlaps but is metadata? [998006784, 4096]
Aborted

Subsequently, I see a relevant bug report at http://bugzilla.opensuse.org/show_bug.cgi?id=897774, and a thread somewhat strangely posted in our Applications forum at https://forums.opensuse.org/showthread.php/501741-btrfs-amp-3-17-kernel-Failsystem-turns-to-readonly.

Apparently this was all a known issue for kernel 3.17 and read-only snapshots. It’s certainly catastrophic for users of btrfs and snapper. Rebooting with previous kernels e.g. 3.16.3 doesn’t solve it.

I still have the corrupted Tumbleweed installed if anyone can suggest a repair? Otherwise I will have to reinstall it, probably over 13.2 release.

“The Tumbleweed is dead. Long live the Tumbleweed (to be regenerated on 4 November)”!

Romanator · October 27, 2014, 3:02am

Ref: http://bugzilla.opensuse.org/show_bug.cgi?id=897774
Copied from David Sterba’s response.

This was caused by a patch in 3.17 and the error is persistent on the image. There’s a fsck fix on the way.

It should be resolved in time for the release of openSUSE 13.2

consused · October 27, 2014, 6:03pm

That’s the same in link(s) I already provided, so it’s safe to assume I also read comments in the bug report.

It doesn’t need to be resolved in time for 13.2, releasing with kernel 3.16.6 (shouldn’t have the offending patch). Kulow’s comments in the Factory ML, at http://lists.opensuse.org/opensuse-factory/2014-10/msg00622.html, confirm the kernel release version as that.

Many btrfs updates were on the way for 3.17, but apparently haven’t made it in time and will merge with 3.18 instead, see this from Phoronix, Btrfs Changes Rejected For The Linux 3.17 Kernel - Phoronix.

I somehow doubt the fix will actually repair any corruption of extents on the file system (as seen by btrfsck), who knows?

Romanator · October 27, 2014, 9:24pm

That’s what happens when you don’t submit your patches on time. Linus has to be strict about the dates due to the large amount of people contributing code.

I somehow doubt the fix will actually repair any corruption of extents on the file system (as seen by btrfsck), who knows?

Argh! Since A lot of people are trying out btrfs (including me) for the first time. Let’s hope that it doesn’t happen before 3.18 is available for download.

Check out this link: Re: Random file system corruption in 3.17 (not BTRFS related...?) — Linux BTRFS
The author suggest switching to writable snapshots instead for read-only snapshots.

Maybe the openSUSE kernel devs can backport the fsck patch.

consused · October 28, 2014, 3:46pm

Yes, it is a very serious bug affecting btrfs users blundering into 3.17 kernel, but fortunately doesn’t affect those on standard 13.1 and hopefully not on the new standard release.

I had seen that “linux-btrfs” link from the other thread (in Applications), but the workarounds are only useful to a system that has fixed the “read only” state of the file system image. Without that, the only part of Tumbleweed I have write access to is /boot separately partitioned as ext2. I could manually delete the 3.17 kernel(s), but it won’t solve anything unless I can regenerate the root partition’s btrfs, either from a previous cloned image (>20GB) or as in my case it means a reinstall.

On the other hand, the author believing writeable snapshots don’t trigger the issue, could imply that any real corruption may be limited to the read-only snapshots in /.snapshots. I had already noticed some more recent snapshots filed there, compared to those last reported by “snapper list”. I need to investigate timing and content of those differences. It may also pinpoint exactly when the problem occurred, e.g after the first 3.17.1 update rather than the second. Also, /var/log/Snapper.log has no entries after rebooting from the second 3.17.1 update.

consused · October 28, 2014, 11:20pm

Well if they do, but I ran “btrfsck --repair” which could mean the patched fsck can’t fix my corrupted snapshots, as stated in the last posting of the thread you linked to. However there were no indications from btrfsck that it made any changes before aborting, so anything is possible.

I’ve now located corrupted snapshots on the file system under /.snapshots, named 3688, and 3691 through 3697. All were created on the system while running the 3.17.1-51.1 kernel. They are best viewed as directories under /.snapshots using command line, for example:

/.snapshots # ls -l 3688
ls: cannot access 3688/snapshot: Cannot allocate memory
total 356
-rw------- 1 root root 356950 Oct 20 18:36 filelist-3687.txt
-rw------- 1 root root    187 Oct 20 18:33 info.xml
d????????? ? ?    ?         ?            ? snapshot

The “snapshot” directory with the “?” marks identifies corruption (whereas Dolphin does not provide that clue, with no message/directory displayed). Compare that to a normal one (3687) created a few minutes before the now corrupted 3688:

/.snapshots # ls -l 3687
total 8
-rw------- 1 root root 202 Oct 20 18:30 info.xml
drwxr-xr-x 1 root root 186 Oct 20 18:14 snapshot

In fact 3687 is the “pre” snapshot of an update to “bash” (via zypper dup) and the missing one, 3688, is the “post” snapshot (it has the additional filelist to show changed files).

I say “missing snapshot” because when I now run “snapper list” on the system updated with 3.17.1-52.1 it fails to llist snapshots 3688, and 3691 through 3697 as they are all corrupted with an inaccessible “snapshot” directory.

I hope this provides a useful illustration of what to look for if anyone else has similar problems in future.

consused · October 30, 2014, 12:16pm

Just seen a recent bug report re 3.17.1/btrfs for Archlinux at FS#42563 : [linux] [btrfs-progs ] kernel 3.17.1-1 with btrfs-progs 3.16.2-1 corrupts fs using btrfs snapshots, with comments about a potential workaround/fix using btrfsprogs 3.17, but it’s not in Tumbleweed yet and must first appear in Factory. A new comment has appeared in openSUSE’s bug report, also requesting btrfsprogs 3.17.

The archlinux report comments on a fix in kernel 3.18rc2.

consused · November 4, 2014, 4:30pm

It seems clearer now that this bug is fixed in kernel 3.17.2, or it will be when it arrives in factory, and more specifically in new Tumbleweed. That archlinux report now has a comment at the end, and there is also this external thread at Re: Kernel 3.17.2 and RO snapshots — Linux BTRFS. It confirms that “Data corruption occurs when creating RO snapshots”, and it seems to only affect the snapshots.

For repairing my old Tumbleweed: btrfsprogs 3.17 will first be needed to correct the metadata, and kernel-desktop 3.17.2 to avoid the issue again. Neither are in Factory as of today, although 3.17.2 is in Kernel:stable (OBS, 1-click). We wait in anticipation…

salaman · November 5, 2014, 3:55am

I just fixed my version of tumbleweed that was affected by this read-only snapshot bug. Here’s how I did it:

Note: I’ve heard this only works if you haven’t yet run the btrfs repair command. Keep your backups at the ready.
Upgrade to kernel-desktop 3.17.2 and btrfsprogs 3.17 via software.opensuse.org.
Note: I didn’t subscribe to their associated repos. The key is that from now on you don’t downgrade back to the old buggy versions.
Reboot and confirm both upgrades are running/active. Hint: use

uname -r

in the terminal. Also check yast’s software manager.

Grab yourself another linux system running kernel 3.17.2 and btrfsprogs 3.17. This is necessary because the btrfs command we will run cannot be performed on a mounted filesystem.
I used a live usb of archbang from november 3rd. It came with linux 3.17.2 and once booted I used

sudo pacman -Syyu

to bring the system up to date and

sudo pacman -S btrfs-progs

to upgrade btrfs-progs to version 3.17.

From your extra linux system that is now also fully upgraded, run

sudo btrfs check --repair /dev/sdxy

with x and y corresponding to your corrupted root btrfs partition. (Encryption obviously makes this more complicated, but I managed to stumble through it. You can too.)

With any luck, you can boot into your tumbleweed install and the corrupted snapshots will be gone. The snapper gui and its command line equivalent should both look nice and tidy. If this isn’t the case or if you have experienced other file system corruption you might need to nuke and pave. Remember those backups from earlier?

Now you need to keep 3.17.2 and 3.17 around until the tumbleweed repos match or exceed those versions. Of course, it would have been a little easier to wait for the tumbleweed repos to update to 3.17.2 and 3.17 but where’s the fun in that?

salaman · November 5, 2014, 4:09am

Looks like I missed the 10min edit window.

After running the btrfs command it should output 0 errors. If it outputs 8 or a different number, that is not good and it needs to be looked at.

consused · November 5, 2014, 1:59pm

Thanks for posting such a well-formatted procedure.

The corruption is one part of the failure, another is the persistent read-only state of the whole btrfs root partition (even without running btrfsck --repair). On my system, the latter appears to prevent the means to effect any software upgrade or any other changes on the file system. Therefore your “2. Upgrade to…” suggests that the read-only condition of the file system had not occurred on yours at that stage. Is that right?

I was under the impression from the openSUSE bugzilla that 3.17 version of btrfsck was required to first fix the read-only file system. Yes, that would need to be done via another operating system, before any upgrading could take place.

I assume you used Kernel:stable to source 3.17.2, but for btrfsprogs 3.17 there were only unstable “home:user” versions available, so which one did you use?

salaman · November 5, 2014, 7:48pm

You’re welcome.

That’s right, my file system damage was limited to some corrupted read-only snapshots. I could read and write as I pleased and there was no other corruption that I could find. I’m unsure as to why some people (such as yourself) are having problems with the whole filesystem being read only. I’ve skimmed the upstream kernel mailing list and can’t figure out why some systems go read-only while other systems only have snapshot or other limited corruption problems.

I’m speculating here but I also have the impression that you can repair an unupgraded read-only filesystem. Maybe then it wouldn’t be read-only. Then you’d be able to upgrade it and repair it again. Or perhaps doing this would make the damage even worse.

Yes, you need version 3.17 of btrfsprogs. It contains btrfsck. According to btrfs.wiki.kernel.org “btrfsck is an alias of btrfs check command and is now deprecated.”

For btrfsprogs 3.17 I used

home:ojkastl_buildservice

from openSUSE Factory.

Yep, for kernel-desktop 3.17.2, I used Kernel:stable from openSUSE Factory. This one is signed differently from my previous version so I had to disable secureboot.

piggz · November 5, 2014, 10:43pm

The BTRFS corruption bug was the last straw for me and that filesystem. Ive used BTRFS since installing 13.1 on this laptop for / and /home. I have found the performance to be terrible, but have put up with it due to the hassle of re-installing. To summarise
I have a 1tb hybrid drive
Ive had nodatacow as my mount options for some time, but that hasnt helped much.
I have run out of space on / because of snapper snapshots.
I find that regulalry btrfs-transaction processes are running at 100% in iotop.
Running VirtualBox VMs is dreadful
Installing software with zypper/rpm is slower than on my old core2duo laptop with 5400rpm disk
Compiling software is slow, with gcc often being stuck at 100% IO
I regularly join #btrfs on freenode, but have not got any suggestions that have improved performance. Often there is criticism of kde’s symantic search, but ive got that turned off anyway.

I decided to update to 13.2, and after downloading all the RPMs, my filesystem went ro. I downloaded the 13.2 iso instead and chose ext4 for / (with /home remaining on btrfs at the moment). I dont think I will look back, instantly this laptop feels faster. Software installation flies by. I just need some time to back up /home so I can format it, and im sure my problems building software will be sorted.

Im sure other ppl will have positive experiences with BTRFS, but, if I have to turn off all the fancy features like datacow, compression and snapshots to get decent performance, I may aswell use a simpler fs.

consused · November 6, 2014, 1:10am

I can understand your frustration with this bug, but those of us running Tumbleweed always run the risk of updating to a potentially unstable kernel. It would be a pity if this specific thread turned into a general rant about btrfs, even with real issues.

Ive used BTRFS since installing 13.1 on this laptop for / and /home. I have found the performance to be terrible, but have put up with it due to the hassle of re-installing.

Clearly some are experiencing performance problems, e.g. this bug report 841797 – BTRFS keeps fragmenting leading to unacceptable performance

I have run out of space on / because of snapper snapshots.

I doubt you will be the last to do that, in spite of previous concerns and warnings in these forums wrt aggressive default snapshot settings for average desktop users.

I decided to update to 13.2, and after downloading all the RPMs, my filesystem went ro.

To be clear, did you run “zypper dup” from a Tumbleweed system with kernel 3.17.1 at that time, with snapper active?

consused · November 7, 2014, 8:59pm

Our speculation is now fact, as of today! Removal of the read-only state of the root file system/partition and repair of /.snapshot was successful. AFAICT, all the affected snapshots were recovered.

Using btrfsprogs 3.17, “btrfs check” reported “Found 10 roots with an outdated root item.”

Using btrfs “check --repair”, it reported “Fixed 10 roots.” and after various checks reported “found 15108884972 bytes used err is 0”.

My procedure was a little different:

Download and burn “openSUSE-Factory-Rescue-CD-x86_64-Snapshot20141105-Media.iso” as listed on Wiki’s Tumbleweed Portal.
Boot with Rescue CD (First time I’ve used openSUSE’s Rescue CD - pretty good).
Install btrfsprogs 3.17 from “http://download.opensuse.org/repositories/filesystems/openSUSE_Tumbleweed/”. Only found this available (since 4/11) when I searched OBS specifically for Tumbleweed, so avoided using development “home:user” source. Note: Only required while unavailable from main Tumbleweed oss repo.
Run "btrfs check --repair /dev/yourpartitionid
" on unmounted partition, requires superuser privileges. 1. If repair successful, [cross everything], and restart Tumbleweed with 3.16.3 kernel.
Remove the 3.17.1 kernel.

Snapper list and btrfs stats, had all returned to normal.

I didn’t need to install a 3.17.2 kernel, so I managed to avoid this step:

Following my downtime, old Tumbleweed repo is now empty and the more recent Packman-tumbleweed packages will have missing dependencies, so I just updated using YaST Online Update on the 13.1 update repos. Now I just have to decide when and what to do wrt upgrading to new Tumbleweed.

My thanks to all who posted here with suggestions, solutions, and information.

salaman · November 11, 2014, 7:09pm

That’s great news! I’m glad you got it fixed.

I was unaware that opensuse offered a rescue cd. I’m going to put it on a USB stick for the future. Just in case.

consused · November 14, 2014, 4:03pm

Thanks. Definitely fixed, although the fun & games continued a while. Decided to re-base on 13.2 via zypper dup with about 2000 package changes, nearly all upgrades which wasn’t always the case with old Tumbleweed. That worked well, grabbing the last 5GB of a 35GB root partition into the btrfs pool with used space still below 30GB as always. After a few days testing, I anticipated a relatively small upgrade from there to new Tumbleweed, as in the past.

Oh… that was my reckless assumption! With yet another 2000 package changes downloaded, I had to abort zypper dup during the installing phase with just 200+ packages short of the finish line and space running out. Luckily a reboot worked well enough for removing many of the older snapshots, and yet another zypper dup installed the remainder from the previous cache.

New Tumbleweed is working very well so far, but snapshot “cleanup” needs to be even more agressive [or get a new faster laptop]. I still feel there is a performance overhead with btrfs+snapshot management. Good facilities for developers/testers, but not sure it yields much advantage for normal desktop users.
Well finally kernel 3.17.2 and btrfs 3.17 just installed from Tumbleweed repo today, and running ok.

I was unaware that opensuse offered a rescue cd. I’m going to put it on a USB stick for the future. Just in case.

Yes, it’s not the fastest rescue liveCD (on CD) on the planet, but it sports familiar desktop and tools e.g. YaST package management and zypper.