There are BTRFS errors. Systemd journal errors at the end are self-explanatory.
Which procedure was used to create the secondary installation?
There are BTRFS errors. Systemd journal errors at the end are self-explanatory.
Which procedure was used to create the secondary installation?
I appreciate your reply … and correct - as I stated, I see the BTRFS errors and the journal errors.
But I guess what I’m hoping to decide is whether the journal errors are showing up because of the BTRFS errors. Or if it’s the journal issue causing BTRFS problems. I assume that would determine “what” needs to get fixed.
It seems that you’re hinting that it’s the BTRFS errors are the main cause? If yes, what do you suggest as a fix ?
====
To answer your question about the installation procedure. I did it as most anyone would do a fresh installation, to a desktop or laptop. (Keep in mind, both of these TW installations have been running at least three years now on this machine. We have a second desktop that’s configured the exact same way.)
Line 2049 in your paste shows that after btrfs failure it is forced readonly, so no write processes can take place. No permanent issues with journal itself. It only warned that it can’t operate on a readonly filesystem. SDDM also failed under this state, even though you could login at the command line and start plasma directly. You won’t be able to persist anything on this filesystem until it is fixed (no persistent journal, no system updates, etc).
This wiki entry can be helpful for your troubleshooting, read carefully SDB:BTRFS - openSUSE Wiki
Otherwise that would be above my paygrade.
Thanks @awerlang … I will read the Wiki article
I am booted into the secondary TW install I have on the secondary NVME drive
I just ran btrfs check on the root partition on the primary drive and this is the result of the check (the “not enough memory” error is strange):
ren :~ # btrfs check /dev/nvme1n1p3
Opening filesystem to check...
Checking filesystem on /dev/nvme1n1p3
UUID: 2b2def5c-620e-4317-91ba-8bf47ced7401
[1/7] checking root items
[2/7] checking extents
data extent[1766604800, 4096] referencer count mismatch (parent 1308884992) wanted 0 have 1
data extent[1766604800, 4096] bytenr mimsmatch, extent item bytenr 1766604800 file item bytenr 0
data extent[1766604800, 4096] referencer count mismatch (parent 7945814016) wanted 1 have 0
data extent[1766604800, 4096] referencer count mismatch (parent 8012922880) wanted 0 have 1
data extent[1766604800, 4096] bytenr mimsmatch, extent item bytenr 1766604800 file item bytenr 0
data extent[1766604800, 4096] referencer count mismatch (parent 1241776128) wanted 1 have 0
backpointer mismatch on [1766604800 4096]
data extent[1766858752, 4096] referencer count mismatch (root 257 owner 3698286 offset 3538944) wanted 0 have 1
data extent[1766858752, 4096] bytenr mimsmatch, extent item bytenr 1766858752 file item bytenr 0
data extent[1766858752, 4096] referencer count mismatch (root 257 owner 3697262 offset 3538944) wanted 1 have 0
backpointer mismatch on [1766858752 4096]
data extent[1767280640, 4096] referencer count mismatch (root 257 owner 3698286 offset 3047424) wanted 0 have 1
data extent[1767280640, 4096] bytenr mimsmatch, extent item bytenr 1767280640 file item bytenr 0
data extent[1767280640, 4096] referencer count mismatch (root 257 owner 3698282 offset 3047424) wanted 1 have 0
backpointer mismatch on [1767280640 4096]
data extent[1771085824, 8192] referencer count mismatch (parent 1321697280) wanted 0 have 1
data extent[1771085824, 8192] bytenr mimsmatch, extent item bytenr 1771085824 file item bytenr 0
data extent[1771085824, 8192] referencer count mismatch (parent 1321435136) wanted 1 have 0
backpointer mismatch on [1771085824 8192]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
ERROR: bytenr ref not found for parent 7945814016
ERROR: not enough memory: accounting for refs for qgroups
ERROR: failed to check quota groups
found 21178617856 bytes used, error(s) found
total csum bytes: 18968900
total tree bytes: 830472192
total fs tree bytes: 770867200
total extent tree bytes: 32800768
btree space waste bytes: 217723490
file data blocks allocated: 65441669120
referenced 58874085376
ren :~ #
There are no hardware errors (at least, during this boot). The message start tree-log replay
implies that filesystem was not properly unmounted. Filesystem is corrupted, probably earlier (could be due to unclean shutdown). If you are interested in keeping this filesystem, post question to Btrfs mailing list - btrfs Wiki (kernel.org), provide this dmesg output and output from btrfs check
.
btrfs check
needs a lot of space for metadata. You could try btrfs check --mode=lowmem ...
, it may find additional problems. But seeing that this message was in quota groups check, this is probably the least interesting problem (quotas can be rebuilt at any time).
Yea, I checked for any drive errors also. As far as the “probably an unclean shutdown” - minutes before this BTRFS issue showed up, I had just completed a “zypper dup”, i.e., I did the “dup”, then restarted the system as usual. Upon boot, the BTRFS errors showed up.
I personally did not do any drive mounting or un-mounting.
BTW - when the BTRFS issue first showed up, I immediately ran a dmesg .
Then the second time I booted to that TW installation (the next day), got the same BTRFS errors, and ran a another dmesg, and compared the two - they are exactly the same entries.
Yea, I’ll give the Wiki submission option a couple of hours thought / consideration. Your key phrase, “If you are interested in keeping this filesystem”, is what’s holding me off, because I’m also considering switching from BTRFS. I may submit to the Wiki and give it a day to see what the response is. I don’t want too many days to go by without a zypper dup
I might try that.
Interesting though. This machine has 64GB of RAM. This (and the other desktop and two TW laptops) are personal machines, not server-side business-oriented machines, so there’s no “hard-core” system usage, RAM-wise. What I’ll do shortly is re-run the btrfs check again and monitor RAM usage. The btrfs check only took seconds to run. I’ve often considered bumping the RAM to 128GB and the two machines, but that would be way overkill !
This machine has 64GB of RAM.
Well, if metadata is corrupted, we cannot exclude that it simply miscalculates needed memory.
==== Almost-Concluding Thoughts ====
So, I may submit to the BTRFS Wiki and see what kind of response I get.
I’m still at 50 / 50 thought of replacing the root partition away from BTRFS to another filesystem that’s been long-standing, trustworthy, and reliable, and with proper tools to recover filesystem issues.
I’ve had one other instance where I’ve had a BTRFS / mounting read-only issue. It was probably a year+ ago. Unfortunately, I didn’t document what the issue was and the fix, but I was able to “fix” the problem, mostly out of frustration - probably a “luck of the draw”, as they say. I may have screenshots of it back in Google Photos, but I’m not gonna search now
Here’s the MAIN reason why I’m considering switching away from BTRFS: has anyone read the documentation related to the btrfs --repair option?
==== It’s the first option under the category labeled Dangerous Options.
And even the SUSE SLES documentation has the statement:
WARNING : Using ‘–repair’ can further damage a filesystem instead of helping if it can’t fix your particular issue.
Anyway, I’m prepared to switch the filesystem type, if I decide. What I suspect I will do before then is to run some of the btrfs options provided in the SUSE SLES documentation, i.e., probably a repair, knowing that it most likely will not work.
I’ll report back !
…
the “not enough memory” error is strange
btrfs check
needs a lot of space for metadata.
You could trybtrfs check --mode=lowmem ...
, it may find additional problems. But seeing that this message was in quota groups check, this is probably the least interesting problem (quotas can be rebuilt at any time).
Okay, I just now ran btrfs check two ways - here’s with lowmem set (too much to paste in here directly):
What I suspect I will do before then is to run some of the btrfs options provided in the SUSE SLES documentation, i.e., probably a repair, knowing that it most likely will not work.
At a pre-corona openSUSE Conference (can’t remember which one and, can’t find the session which I thought was relevant – I checked the 2016, 2017 and 2019 conferences – I wasn’t in Prague for the 2018 conference), one of the (well respected) presenters mentioned during a session in the main hall in the Z-Bau that, the Btrfs repair tools are reasonably reliable and, if you absolutely have to use them then, about 99 % of the time you’ll be spared a re-installation …
Thanks for the positive encouragement !!
I am about to boot into that primary TW install and gather some info the BTRFS Mail-List requires for submitting problems - afterwards, I’ll boot back here to this secondary to do the submission and then run to other checks.
In the meantime, here’s the output from smartctl:
ren:~ # smartctl -A /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.12-1-default] (SUSE RPM)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 3%
Data Units Read: 9,031,450 [4.62 TB]
Data Units Written: 37,105,670 [18.9 TB]
Host Read Commands: 80,316,765
Host Write Commands: 196,330,891
Controller Busy Time: 1,064
Power Cycles: 635
Power On Hours: 1,094
Unsafe Shutdowns: 171
Media and Data Integrity Errors: 0
Error Information Log Entries: 1,851
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 37 Celsius
Temperature Sensor 2: 40 Celsius
ren :~ #
You’ll need to check how often “btrfs trim” has been executed on that drive –
Hmmmm. I’ve done some searching BTRFS docs for “how often “btrfs trim has been executed”, but haven’t come up with anything.
I do see it’s an option when mounting the filesystem, and I’ve read it’s usually the default as of 6.x … mine is btrfs --version == btrfs-progs v6.1.3 … I should check the mount parameters for the TW installation.
I’ve done some searching BTRFS docs for “how often “btrfs trim has been executed”, but haven’t come up with anything.
You need to check “systemctl status btrfs-trim.timer” to see how often it should be executed on your system.
You need to check “systemctl status btrfs-trim.timer” to see how often it should be executed on your system.
Here - keep in mind, even though I’m booted into that TW Install at the moment, root is mounted read-only (but as my regular user, i can run startx for a KDE session).
ren # systemctl status btrfs-trim.timer
btrfs-trim.timer - Discard unused blocks on a mounted filesystem
Loaded: loaded (/usr/lib/systemd/system/btrfs-trim.timer; disabled; preset: enabled)
Active: inactive (dead)
Trigger: n/a
Triggers: * btrfs-trim.service
Docs: man:fstrim
[quote="dcurtisfra, post:17, topic:165861, full:true"]
You can check the systemd Journal to see when in the past the Btrfs trim procedure has been executed.
.
Here’s the output I get:
ren : # journalctl | grep trim > journalctl-trim.txt
Mar 28 21:53:23 ren fstrim[12850]: /boot/efi: 494.5 MiB (518529024 bytes) trimmed on /dev/nvme0n1p1
Mar 28 21:53:23 ren fstrim[12850]: /home: 211.7 GiB (227299934208 bytes) trimmed on /dev/nvme0n1p4
Mar 28 21:53:23 ren fstrim[12850]: /: 6.9 GiB (7375536128 bytes) trimmed on /dev/nvme0n1p3
Mar 28 21:53:23 ren systemd[1]: fstrim.service: Deactivated successfully.
Mar 28 21:53:23 ren systemd[1]: fstrim.service: Consumed 1.192s CPU time.
Mar 28 22:00:11 ren systemd[1]: fstrim.timer: Deactivated successfully.
Mar 29 22:28:58 ren systemd[1]: fstrim.timer: Deactivated successfully.
Mar 30 22:46:58 ren systemd[1]: fstrim.timer: Deactivated successfully.
Mar 31 13:32:18 ren systemd[1]: fstrim.timer: Deactivated successfully.
Mar 31 23:35:22 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 01 21:29:08 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 02 22:09:35 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 03 20:45:24 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 03 20:47:07 ren fstrim[4748]: /boot/efi: 494.5 MiB (518529024 bytes) trimmed on /dev/nvme1n1p1
Apr 03 20:47:07 ren fstrim[4748]: /home: 211.6 GiB (227230113792 bytes) trimmed on /dev/nvme1n1p4
Apr 03 20:47:07 ren fstrim[4748]: /: 11 GiB (11859058688 bytes) trimmed on /dev/nvme1n1p3
Apr 03 20:47:07 ren systemd[1]: fstrim.service: Deactivated successfully.
Apr 03 20:47:07 ren systemd[1]: fstrim.service: Consumed 1.350s CPU time.
Apr 03 20:47:57 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 04 11:58:58 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 04 22:41:30 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 05 21:32:45 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 06 20:48:09 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 07 20:26:42 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 10 21:06:27 ren systemd[1]: fstrim.service: Main process exited, code=killed, status=15/TERM
Apr 10 21:06:27 ren systemd[1]: fstrim.service: Failed with result 'signal'.
Apr 10 21:06:27 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 10 21:08:19 ren fstrim[7333]: /boot/efi: 494.5 MiB (518529024 bytes) trimmed on /dev/nvme1n1p1
Apr 10 21:08:19 ren fstrim[7333]: /home: 211.6 GiB (227181912064 bytes) trimmed on /dev/nvme1n1p4
Apr 10 21:08:19 ren fstrim[7333]: /: 12.5 GiB (13399752704 bytes) trimmed on /dev/nvme1n1p3
Apr 10 21:08:19 ren systemd[1]: fstrim.service: Deactivated successfully.
Apr 10 21:08:20 ren systemd[1]: fstrim.service: Consumed 1.699s CPU time.
Apr 10 21:47:40 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 11 21:38:22 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 12 18:57:29 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 13 23:13:03 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 14 22:25:35 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 15 23:17:21 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 16 21:21:06 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 17 10:59:56 ren fstrim[7283]: /boot/efi: 494.5 MiB (518529024 bytes) trimmed on /dev/nvme1n1p1
Apr 17 10:59:56 ren fstrim[7283]: /home: 211.4 GiB (226985738240 bytes) trimmed on /dev/nvme1n1p4
Apr 17 10:59:56 ren fstrim[7283]: /: 13 GiB (13934247936 bytes) trimmed on /dev/nvme1n1p3
Apr 17 10:59:56 ren systemd[1]: fstrim.service: Deactivated successfully.
Apr 17 10:59:56 ren systemd[1]: fstrim.service: Consumed 1.526s CPU time.
Apr 17 11:22:40 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 17 22:49:33 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 18 12:20:02 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 19 23:49:33 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 20 21:26:51 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 21 09:52:38 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 21 10:19:50 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 21 10:21:31 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 22 00:15:39 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 22 21:10:24 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 22 21:12:03 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 23 17:19:08 ren systemd[1]: fstrim.timer: Deactivated successfully.
Apr 23 21:42:45 ren systemd[1]: fstrim.timer: Deactivated successfully.
== end ==
I am HAPPY to report that I ran the steps, one by one, found in here
repair broken btrfs
… which was suggested by @awerlang
I also toggled to this webpage to compare the steps (SLES documentation):
How to recover from BTRFS errors
I am now booted in the the primary TW installation - root (BTRFS filesystem) is now mounted r-w, no more read-only mount ! The KDE Plasma login screen showed up as usual, I logged in and here I am
I did run dmesg just after login to check the boot up results, and there are a couple of niggles:
[ 9.597221] BTRFS warning (device nvme1n1p3): checksum verify failed on logical 18335219712 mirror 1 wanted 0x29262f58 found 0x914d513d level 1
[ 9.597235] BTRFS warning (device nvme1n1p3): error accounting new delayed refs extent (err code: -5), quota inconsistent
...
... and ...
[ 32.462569] systemd-journald[701]: /var/log/journal/a9637f095381461d9ace3985c0ae5331/user-1000.journal: Journal file corrupted, rotating.
I did redirect output of the –repair, so we could see all the fixes, but unfortunately I wanted to append more info onto that file and I used the redirect > instead of append >>. Oh well.
I am going to boot out of here, to the secondary and run a check, then I’ll boot back to this primary and run a zypper dup.
So, a BIG THANKS to @awerlang and @arvidjaar and @dcurtisfra !!
.
Weird – here on Leap 15.4 –
> grep -Ri 'btrfs-trim' /usr/lib/systemd/*
/usr/lib/systemd/system/btrfs-scrub.service:After=fstrim.service btrfs-trim.service
/usr/lib/systemd/system/btrfs-defrag.service:After=fstrim.service btrfs-trim.service btrfs-scrub.service
/usr/lib/systemd/system/btrfs-balance.service:After=fstrim.service btrfs-trim.service btrfs-scrub.service
/usr/lib/systemd/system/btrfs-trim.service:ExecStart=/usr/share/btrfsmaintenance/btrfs-trim.sh
/usr/lib/systemd/system-preset/95-default-SUSE.preset:enable btrfs-trim.timer
>
> find /usr/lib/systemd/ -iname '*btrfs-trim*'
/usr/lib/systemd/system/btrfs-trim.timer
/usr/lib/systemd/system/btrfs-trim.service
>
AFAICS, the “btrfs-trim.timer” should be, by default, “enabled” and, AFAICS, there ain’t nothing – no link in a ???.wants/ directory – which is calling either the systemd service or, the systemd timer …
AFAICS, the “btrfs-trim.timer” should be, by default, “enabled” and, AFAICS, there ain’t nothing – no link in a ???.wants/ directory – which is calling either the systemd service or, the systemd timer …
Might read thru this thread (check for @karlmistelberger comments):
Hmm. btrfs-(balance, defrag, trim, scrub). They are both disabled and inactive. And cannot be enabled or made active. This service cannot be enabled/disabled because it has no "install" section in the description file. Possibly this is why I created a manual method; I do not recall how it came about as it was 2 - 3 years ago when btrfs was just going public and SSDs were no longer expensive toys. I do recall being deeply annoyed by having snapper on by default, and it creating dozens of Time…
Might read thru this thread
Yes, I know but –
Yes, this concept is not easy to understand – only software engineers have a reasonable chance of understanding it without having to research it too much …
And, software engineers are mostly, me included, absolutely out of their minds – you may even class us as being crazy …
And, software engineers are mostly, me included, absolutely out of their minds – you may even class us as being crazy …
I completely understand ! I’m also a software engineer going back 30+ years, but I retired a little over 2 years ago, so my mind is getting mushy. (I’m also a published author of a Linux book, co-author of 2 C++ lang books, and Technical Editor for 2 books: UML and the C lang, all back around 2000 year timeframe).