btrfs problem with subvolume -- ERROR: errors found in root refs

I was trying to run 3dMark through Steam Proton, and the Steam install script tried installing .NET, which launched hundreds of zombie processes and stalled my system. I tried to killing wineserver, but it was completely unresponsive, and just continued launching more zombie processes. So, eventually I tried to rebooting the system, since I could at least move the mouse cursor but trying to reboot was also unresponsive. I was at least able to use ctrl-alt-F2, where I tried init 6, init 0, systemctl --reboot, reboot -f, and even with reboot -f -f, and all of them failed. After about a half hour, I figured I was out of options at that point, and there didn’t seem to be any activity on the system, so I used the reset button.

After shutting down, and trying to reboot, the boot process halted due to errors on the system drive, and I got a maintenance prompt. I ran btrfs check, and there were about a half dozen or so errors on the root partition. I ended up running btrfs check --repair, before noticing the warning about it basically being in a alpha state, and possibly causing more problems then before you started with it. It did at least get rid of the csum errors, but it also seems to have invalidated the free space cache. After rebooting again, there is no usable free space, so it still is unable to boot due to being unable to create the system journal, and btrfs check still shows one remaining error. It would seem that the @/var/crash subvolume is either missing or just the reference for it is missing. I’m not sure which, since the error is rather vague, and I can’t seem to find any relevant information on the btrfs wiki site about it and I really haven’t been able to find anything useful with Google either.

Here is the relevant info, from btrfs check-

checking root refs
fs tree 264 not referenced
ERROR: errors found in root refs
found 35524771840 bytes used, error(s) found

total csum bytes: 297454408
total tree bytes: 1291567104
total fs tree bytes: 1176010752
total extent tree bytes: 72171520
btree space waste bytes: 190606495
file data blocks allocated: 48896675840
referenced 45388656640

I used the openSUSE rescue utility on the install disk and tried using btrfs check --clear-free-space-cache on the relevant drive partition, to fix the free space problem, but I am still unable to write to the drive and it’s still showing no available free space, even though the drive is not full. So I don’t know if there is still a problem with the free space cache, or if the partition is just being automatically mounted read-only, due to the subvolume error. I haven’t tried to explicitly mount it using the rw option, and I didn’t want to try to force it, if it’s a safety feature. When I get a listing of subvolumes when the drive is mounted, it shows subvolume 264, which is @/var/crash, but when checking the subvolume list from the rescue cd, it’s missing from the subvolume list. I’m not sure if I should try to delete it or try to create a new one, or if there is another way to repair it. I don’t have any snapshots. Right now, it looks like the data on the drive is all intact, so I don’t really want to do anything to jeopardize the data. I’d like to run scrub, but it doesn’t sound like that’s going to fix a subvolume problem, and I think I probably need to fix the subvolume problem before sorting out the problem with the free space, but I’m not really sure where to go from here. So if anyone can offer any advise or help, I would greatly appreciate it.

@drakkar123:

Maybe, possibly, you’ve discovered a Use Case «Steam + .NET (and C#) + Wine» which indicates that, a Btrfs system partition needs to be somewhat larger than the default size.

  1. If possible, try re-installing with a Btrfs system partition of at least 80 GB – after backing up your user partition …
  2. If you’re having to live with less than 250 GB of disk space then, consider re-installing with an ext4 system partition – with enough space for the /tmp and /var directories – at the worst case, a single ext4 partition for everything – system and user directories …
  3. If you want to remain with Btrfs then, ensure that, the Snapper images are regularly being purged – especially if, your system disk is smaller than 250 GB …

This past week I also suffered a similar scenario on a machine…
The machine experienced a problem which required the system to be powered off instead of gracefully shut down… Causing BTRFS problems…
Unfortunately in my case the problems were unresolvable so had to be wiped and re-installed.
Lesson learned for anyone reading this, don’t ever power off unless you’re willing to risk unresolvable corruption. And, IMO the BTRFS people really need to figure out how to address this, it can be a near show stopper for many people (I’ve never experienced an unrecoverable file system corruption due to a poweroff using EXT).

Although my situation was unrecoverable,
That’s not necessarily the case for others.

But,
My recovery efforts turned up the following articles which might provide you more help than many others…

The openSUSE “Common Problems and their Solutions”
If you need to repair Grub, it describes how to do that, otherwise if your system at least boots to the Grub menu but fails afterwards, various recovery methods are described in the “Data Problems” section

https://doc.opensuse.org/documentation/leap/startup/html/book.opensuse.startup/cha.trouble.html#sec.trouble.boot

You can try “restoring” your corrupted partition, disk or device to a new mount point. You at least can recover your data, and might be a step towards being part of a full recovery.

https://btrfs.wiki.kernel.org/index.php/Restore

Resolving space problems like what you describe can be found in the kernel.org BTRFS FAQ, you might find some interesting ideas in other sections as well

https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_get_.22No_space_left_on_device.22_errors.2C_but_df_says_I.27ve_got_lots_of_space

Good Luck,
TSU

I try to always cleanly shutdown the system.

Around two weeks ago, I ran into a system that would not shutdown (it was Tumbleweed). It got past the point of “reached targed shutdown”. But it still kept trying to disable swap and unmount the root file system. It seemed to be in an unbreakable loop. So I did a forced power-off (held down the power button till it stopped).

For safety, I then booted to a USB and ran “fsck” (root file system was “ext4”). It recovered from journal without a problem.

I’m still trusting “ext4” more than I trust “btrfs”.

This indicates real filesytsem corruption which is unlikely to be solved in this forum. The most straightforward is to simply recreate filesystem (of course losing data). You may be able to recover data using “btrfs restore” (you need sufficient free space to store the same amount of data somewhere else). Finally if you want to try to salvage filesystem, contact btrfs mailing list. There are developers who are quite responsive to such type of questions and sometimes come up with purposefully built btrfs tool that does repair. Of course it may not be possible to repair, but you will never know until try.

Thank you for the advice, it is helpful. I knew my issues were going to require someone with intimate knowledge of btrfs. I had hoped there might possibly be someone on here that knows btrfs well, especially with it being the default root fs for openSUSE. I have been considering contacting the developers, but I wanted to check here first before contacting them. I also wasn’t too sure how they would respond to being contacted about user problems either.

I realize I could solve the problem by reformatting the partition and just doing a clean install, and I always keep my /home directory on a separate partition, so it’s unaffected. But I have a lot of hand configured files in /etc, several source installs I’d have to redo, and it would take me several weeks to go through and completely re-install all of the repo packages that I have installed. Ultimately, it wouldn’t be a complete disaster for me to have to do that, but it would still be a major pain, and it’s a pain that I’d like to try to avoid if I can. The problem is certainly serious enough to prevent the system from being able to complete the boot process, but I still think it should probably be fairly straightforward to fix, for someone that knows btrfs much better than I do. Perhaps not, but I’m fairly sure that either the reference for @/var/crash subvolume is missing and the directory is present, or the reference is present and it’s the directory that is missing. Either way, there must be a way to recreate it, I just didn’t want to try experimenting too much, and risk data loss and I haven’t been able to find any relevant documentation anywhere that would help either.

All of the data on the affected partition at least seems to be in tact, and the only directory that seems to be affected is /var/crash. Do you by chance happen to know what is normally stored in /var/crash, if anything, or is it just a crash dump directory?

It is just crash dump directory.

It is clear that you can recreate it, but leaving filesystem with obviously inconsistent metadata is just too dangerous. This inconsistency needs to be resolved.

With other FS Lost&found may contain files/directries and fragments. Don’t know if this is so in BTRFS

I am not seeing any “lost+found” directory in a “btrfs” file system.

Okay, thank you for the information.