Apparent BTRFS File System Error

Apparently my NVME SSD which holds the OS and /home has developed a bad sector which is not surprising as it was purchased in 2019. So in the last several days the system has stopped booting, goes into emergencyy mode, but does allow me to continue booting with CTRL-D. Once booted, everything appears to function normally.

Using journalctl -xb from the command line before booting showed me some BTRFS file errors. I asked Llama 3.3 to explain the error messages to me. This is the chat I had with Llama:

###### User
I am running openSUSE Tumbleweed Linux operating system on an nvme SSD drive 
using the BTRFS file system. On booting the system dumps me into emergency 
mode. Examining the system journal, I see a BTRFS error on that device which says 
"tree first key mismatch detected and provides the bytenr the parent-transit-id, the 
key expected and the key actually seen. I then says "error loading props for ino" 
with a number, which I assume is the inode number, and provides an error code -5. 
It then says "Error removing orphan entry, stopping orphan cleanup." It then says 
:Could not do orphan cleanup -22". Can you explain what this error message 
indicates and what steps I might take to have the BTRFS file system repaired?


###### llama-3.3-70b-versatile
The error message indicates a corruption in the BTRFS file system. The "tree first key 
mismatch" error suggests that the file system has detected a discrepancy between 
the expected and actual key values in the BTRFS tree. The subsequent errors, 
including "error loading props for ino" with error code -5 (which corresponds to EIO, 
Input/Output Error), and the failure to remove an orphan entry, indicate that the file 
system is unable to repair itself or perform necessary maintenance tasks.


To repair the BTRFS file system, you can try the following steps:

1. Boot the system with a live Linux media (e.g., openSUSE installation media) and mount the affected BTRFS device in read-only mode.
2. Run the command `btrfs check --readonly` on the mounted device to assess the extent of the corruption.
3. If the corruption is repairable, run `btrfs check --repair` on the mounted device. Note that this may cause data loss, so proceed with caution.
4. If the repair is successful, run `btrfs balance start` to rebalance the file system.
5. If the repair fails or the corruption is severe, you may need to reformat the device and restore data from backups.

Before attempting any repairs, ensure that you have backups of critical data to prevent potential loss. Additionally, consider running `btrfs scrub` regularly to detect and correct errors in the file system.

So my question is: I have read that using “btrfs check --repair” is too dangerous to actually use “without help” - without specifying where one finds “help” or under what conditions it is safe to use.

So should I use that option or not?

I’ve purchased a new NVME SSD which I will receive Monday. I also purchased an NVME external docking station with cloning function, and I intend to clone the existing OS SSD over to the new one.

My second question is: If I clone the existing SSD over to the new one, is that going to bring over the same bad block in the file system and thus continue the file system errors? I’m inclined to assume so. In any event, if cloned and the file errors remain in the cloned version, I assume I’ll have to either repair the file system with “–repair” or reinstall. Naturally I’d like to avoid a reinstall.

Show full dmesg output when error happens and full btrfs check output on the unmounted filesystem.

OK, so I ran dmesg and copied the entire output to a text file. I then ran SystemRescueCD from a flash drive, mounted the nvme0n1p2 partition and ran btrfs -check --readonly and saved the output to a text file on the main system.

The btrfs check command also output a ton of error messages - none of which were saved to the text file capturing the output. I don’t know why that it was but there’s basically nothing I can do about it, I guess.

I also ran Gparted from the flash drive and did a file system check with that. It also put a ton of errors - and also can save the details to a text file - so basically it’s output is the same as the btrfs check (which I assume is what Gparted runs.)

Here is the relevant parts of the dmesg output:

4.743551] [    T914] BTRFS: device fsid 2ba0e812-519e-434b-8cc4-58590b25cec1 devid 1 transid 862029 /dev/nvme0n1p2 (259:2) scanned by mount (914)
[    4.744626] [    T914] BTRFS info (device nvme0n1p2): first mount of filesystem 2ba0e812-519e-434b-8cc4-58590b25cec1
[    4.744644] [    T914] BTRFS info (device nvme0n1p2): using crc32c (crc32c-intel) checksum algorithm
[    4.744648] [    T914] BTRFS info (device nvme0n1p2): using free-space-tree
[    4.746515] [    T914] BTRFS info (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

6.403172] [    T240] BTRFS error (device nvme0n1p2): tree first key mismatch detected, bytenr=172970573824 parent_transid=849601 key expected=(12302195,108,4398046511104) has=(12302195,108,0)
[    6.403301] [    T240] BTRFS error (device nvme0n1p2): tree first key mismatch detected, bytenr=172970573824 parent_transid=849601 key expected=(12302195,108,4398046511104) has=(12302195,108,0)
[    6.403310] [   T1505] BTRFS error (device nvme0n1p2): error loading props for ino 12302195 (root 262): -5
[    6.403864] [    T229] BTRFS error (device nvme0n1p2): tree first key mismatch detected, bytenr=172970573824 parent_transid=849601 key expected=(12302195,108,4398046511104) has=(12302195,108,0)
[    6.404007] [    T240] BTRFS error (device nvme0n1p2): tree first key mismatch detected, bytenr=172970573824 parent_transid=849601 key expected=(12302195,108,4398046511104) has=(12302195,108,0)
[    6.404027] [   T1505] BTRFS error (device nvme0n1p2): Error removing orphan entry, stopping orphan cleanup
[    6.404032] [   T1505] BTRFS error (device nvme0n1p2): could not do orphan cleanup -22

492.161868] [   T9140] BTRFS info (device nvme0n1p2): scrub: started on devid 1
[  557.267513] [   T9140] BTRFS info (device nvme0n1p2): scrub: finished on devid 1 with status: 0

[  602.830390] [    T251] BTRFS error (device nvme0n1p2): tree first key mismatch detected, bytenr=172970573824 parent_transid=849601 key expected=(12302195,108,4398046511104) has=(12302195,108,0)
[  602.830516] [    T251] BTRFS error (device nvme0n1p2): tree first key mismatch detected, bytenr=172970573824 parent_transid=849601 key expected=(12302195,108,4398046511104) has=(12302195,108,0)
[  602.830545] [    T305] BTRFS error (device nvme0n1p2): qgroup scan failed with -5
[ 4203.386744] [  T29366] BTRFS error (device nvme0n1p2): tree first key mismatch detected, bytenr=172970573824 parent_transid=849601 key expected=(12302195,108,4398046511104) has=(12302195,108,0)
[ 4203.386884] [  T29366] BTRFS error (device nvme0n1p2): tree first key mismatch detected, bytenr=172970573824 parent_transid=849601 key expected=(12302195,108,4398046511104) has=(12302195,108,0)
[ 4203.386906] [  T38695] BTRFS error (device nvme0n1p2): qgroup scan failed with -5

[ 7803.716905] [  T37057] BTRFS error (device nvme0n1p2): tree first key mismatch detected, bytenr=172970573824 parent_transid=849601 key expected=(12302195,108,4398046511104) has=(12302195,108,0)
[ 7803.717035] [  T37057] BTRFS error (device nvme0n1p2): tree first key mismatch detected, bytenr=172970573824 parent_transid=849601 key expected=(12302195,108,4398046511104) has=(12302195,108,0)
[ 7803.717059] [  T68419] BTRFS error (device nvme0n1p2): qgroup scan failed with -5

Here is the btrfs check:

Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p2
UUID: 2ba0e812-519e-434b-8cc4-58590b25cec1
cache and super generation don't match, space cache will be invalidated
found 133983600640 bytes used, error(s) found
total csum bytes: 96816680
total tree bytes: 1116504064
total fs tree bytes: 823525376
total extent tree bytes: 165298176
btree space waste bytes: 191527380
file data blocks allocated: 187825262592
 referenced 80927428608

Here is a sample of the errors found (I ran the check using --force while mounted in an attempt to get it saved to a text file - it won’t do it. So this output is off the terminal screen.)

root 262 inode 14550131 errors 2001, no inode item, link count wrong
        unresolved ref dir 48191 index 921051 namelen 18 name c51d4418670ad3c0_0 filetype 1 errors 4, no inode ref
root 262 inode 14550132 errors 2001, no inode item, link count wrong
        unresolved ref dir 48191 index 921052 namelen 18 name 50d344fe669074ed_0 filetype 1 errors 4, no inode ref
root 262 inode 14550134 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36052 namelen 18 name 855b595a7eea0e4f_0 filetype 1 errors 4, no inode ref
root 262 inode 14550135 errors 2001, no inode item, link count wrong
        unresolved ref dir 48191 index 921054 namelen 18 name 5c4c04911a5fc63d_0 filetype 1 errors 4, no inode ref
root 262 inode 14550136 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36053 namelen 18 name 3fbd5e1164943239_0 filetype 1 errors 4, no inode ref
root 262 inode 14550137 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36054 namelen 18 name e9f4403d71fd9c22_0 filetype 1 errors 4, no inode ref
root 262 inode 14550138 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36055 namelen 18 name bf1d7868d4baa6f8_0 filetype 1 errors 4, no inode ref
root 262 inode 14550139 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36056 namelen 18 name 747e032746153f19_0 filetype 1 errors 4, no inode ref
root 262 inode 14550142 errors 2001, no inode item, link count wrong
        unresolved ref dir 48191 index 921056 namelen 18 name b7bd2b743d708afa_0 filetype 1 errors 4, no inode ref
root 262 inode 14550151 errors 2001, no inode item, link count wrong
        unresolved ref dir 48191 index 921057 namelen 18 name e6c098722023e542_0 filetype 1 errors 4, no inode ref
root 262 inode 14550152 errors 2001, no inode item, link count wrong
        unresolved ref dir 48191 index 921058 namelen 18 name 85a650f3a6842215_0 filetype 1 errors 4, no inode ref
root 262 inode 14550153 errors 2001, no inode item, link count wrong
        unresolved ref dir 48191 index 921059 namelen 18 name 2a95e9391ad94996_0 filetype 1 errors 4, no inode ref
root 262 inode 14550154 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36063 namelen 18 name 80eadafe1f5be86d_0 filetype 1 errors 4, no inode ref
root 262 inode 14550160 errors 2001, no inode item, link count wrong
        unresolved ref dir 48191 index 921060 namelen 18 name b87cea53780607d6_0 filetype 1 errors 4, no inode ref
root 262 inode 14550166 errors 2001, no inode item, link count wrong
        unresolved ref dir 48191 index 921061 namelen 18 name 6b2dbea9a965fb53_0 filetype 1 errors 4, no inode ref
root 262 inode 14550167 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36064 namelen 18 name ce0a5b5ac11e14da_0 filetype 1 errors 4, no inode ref
root 262 inode 14550168 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36065 namelen 18 name 90bfaac6d8b80da6_0 filetype 1 errors 4, no inode ref
root 262 inode 14550169 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36066 namelen 18 name 455759dd50e3ed64_0 filetype 1 errors 4, no inode ref
root 262 inode 14550170 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36067 namelen 18 name 2f4e437e96cb375f_0 filetype 1 errors 4, no inode ref
root 262 inode 14550171 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36068 namelen 18 name a54a697c02fa1e4d_0 filetype 1 errors 4, no inode ref
root 262 inode 14550172 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36069 namelen 18 name 7602e5e264b0055b_0 filetype 1 errors 4, no inode ref
root 262 inode 14550173 errors 2001, no inode item, link count wrong
        unresolved ref dir 48193 index 36070 namelen 18 name 8b9bd4e2fec01176_0 filetype 1 errors 4, no inode ref
root 262 inode 14550177 errors 2001, no inode item, link count wrong
        unresolved ref dir 48220 index 4489 namelen 14 name the-real-index filetype 1 errors 4, no inode ref
root 262 inode 14550178 errors 2001, no inode item, link count wrong
        unresolved ref dir 48197 index 11643 namelen 14 name the-real-index filetype 1 errors 4, no inode ref
root 262 inode 14550197 errors 2001, no inode item, link count wrong
        unresolved ref dir 45973 index 27769 namelen 14 name BrowserMetrics filetype 2 errors 4, no inode ref
root 262 inode 14550198 errors 2001, no inode item, link count wrong
        unresolved ref dir 45973 index 27770 namelen 22 name DeferredBrowserMetrics filetype 2 errors 4, no inode ref
root 262 inode 14550199 errors 2001, no inode item, link count wrong
        unresolved ref dir 45973 index 27772 namelen 11 name Local State filetype 1 errors 4, no inode ref
root 262 inode 14550201 errors 2001, no inode item, link count wrong
        unresolved ref dir 45988 index 137775 namelen 11 name Preferences filetype 1 errors 4, no inode ref
root 262 inode 14550217 errors 2001, no inode item, link count wrong
        unresolved ref dir 264 index 1687391 namelen 9 name konsolerc filetype 1 errors 4, no inode ref
root 262 inode 14550218 errors 2001, no inode item, link count wrong
        unresolved ref dir 45988 index 137785 namelen 24 name Network Persistent State filetype 1 errors 4, no inode ref
root 262 inode 14550223 errors 2001, no inode item, link count wrong
        unresolved ref dir 1430 index 41775 namelen 23 name dolphin_dolphin_dolphin filetype 1 errors 4, no inode ref
root 262 inode 14550224 errors 2001, no inode item, link count wrong
        unresolved ref dir 45988 index 137789 namelen 17 name TransportSecurity filetype 1 errors 4, no inode ref
root 262 inode 14550226 errors 2001, no inode item, link count wrong
        unresolved ref dir 264 index 1687397 namelen 12 name kalendaracrc filetype 1 errors 4, no inode ref
root 262 inode 14550228 errors 2001, no inode item, link count wrong
        unresolved ref dir 267 index 433023 namelen 18 name recently-used.xbel filetype 1 errors 4, no inode ref
root 262 inode 14550230 errors 2001, no inode item, link count wrong
        unresolved ref dir 264 index 1687400 namelen 14 name QtProject.conf filetype 1 errors 4, no inode ref
root 262 inode 14550233 errors 2001, no inode item, link count wrong
        unresolved ref dir 264 index 1687404 namelen 10 name kdeglobals filetype 1 errors 4, no inode ref
ERROR: errors found in fs roots
found 133998788608 bytes used, error(s) found
total csum bytes: 96826492
total tree bytes: 2307735552
total fs tree bytes: 2014838784
total extent tree bytes: 165216256
btree space waste bytes: 461200401
file data blocks allocated: 370191470592
 referenced 246101389312

Note that before the other errors occurred the first time I looked at the journal from the command line, this message appeared:

mount: /home: wrong fs type, bad option, bad supreblock on /dev/nvme0n1p2, missing codepage or helper program, or other errors.

After the brtfs errors appeared in the journal, this message appeared:

systemd: Failed to mount /home
ystemd: Dependency failed for local File Systems

That said /home was not mounted but when I go ahead and CTRL-D to finish booting, and look at the mount command, /home IS mounted.

It’s getting to the point where I think I’ll forego cloning the drive, due to the uncertainties about whether I’ll be carrying over bad data from /home, and just install a new SSD and reinstall the entire operating system.

Fortunately I don’t have any data of importance on /home and it’s backed up nightly. All my other data is stored on separate hard drives. However, since I can’t copy any of the /home config files from the backup, due to not knowing which ones might be corrupt, I’ll have to reconfigure everything on the system. That will be a major PITA.

Of course you can. You can redirect both stdout and stderr. Like

btrfs check /dev/sda > out 2>&1

Or you can run it under script command that captures the complete output. Like

script
btrfs check /dev/sda
exit

and the full output is in typescript file. See man script.

4398046511104 == 0x400 0000 0000

It sounds like a bit flip. Bit flips are most commonly associated with bad RAM. Before doing anything you should first test your RAM running something like memtest for a prolonged period, 24 hours is recommended.

Can you mount your system using

mount -r -o rescue=all

?? Or access data using btrfs restore? If yes, you should really consider saving data first.

Whether this problem can be corrected by btrfs check or manually I cannot say. You should address this question to btrfs mailing list ( Btrfs mailing list - btrfs Wiki). You will need the full dmesg output and btrfs check output (you may start with posting the same information first but be prepared to provide them on demand).

Keep in mind that now is vacation time so do not expect immediate answer.

Recover from damage beyond built-in capabilities: btrfs check --repair has no ability to handle this particular and complex corruption.

I did that. Only the basic btrfs check info came out, not the error messages. I don’t know why.

" Or you can run it under script command" Didn’t know about that command. I’ll remember that next time.

I don’t think it’s a memory issue, but perhaps I’ll run a quick memory test. I think it’s much more likely to be an nvme issue given the age of the drive. I’d been considering how soon I should replace it - now I don’t have the choice.

As for saving data, the only data on /home I’d be interested in saying is config files and things like browser bookmarks. Most of that is already backed up nightly. The problem is that I can’t know a priori which ones are corrupt, so even restoring from the nightly backup might just bring over corrupt files into the new reinstall.

For things like the browser bookmarks, I can export those into the system folders I use for saving stuff like that independent of the nightly backups. That’s on a hard drive, so those aren’t likely to be corrupt if loaded into the browser and exported from there to the hard drive. Then I can import them after the reinstall.

So that just leaves config files - and since I’m going to have to do a full system reinstall, I’ll just have to reconfigure everything else. That’s the main PITA. I think from now on I’ll periodically back up the config files to a hard drive in addition to the nightly automatic backup of /home. Most of that stuff is text files anyway, so copied they should be fine - any corruption would be easily seen.

I’m not going to bother with the btfrs mailing list. I don’t have time to wait as the corruption could spread and I have other things to do.

I backup /home(obvious), /etc (to have something to look at with new installations, not to copy it brainless), /boot (several configs there), /root (home directory of root, having some management things there) and /srv (because running a web sever).

Just some suggestions.

I have been told that a response from the btrfs mailing list is probably never going to happen. The btrfs mailing list is used for metadata collection now mainly.

Note: May be beneficial to make contact with channel #btrfs on Libera IRC network.
Note: May be beneficial to run memory test for 9 passes total.

Sorry to reply again but it slipped my mind (may be of help). I have also been told that using Clonezilla to clone a btrfs filesystem has a known issue. I contacted Clonezilla and no response (months gone by).
I have cloned btrfs filesystem using Clonezilla prior to knowing this also, similar experience with btrfs check errors when unmounted, seemingly not as ill as yours’ above for some reason.

There seems to be some confusion about the meaning of the word “cloning”.

When you talk about real “cloning” (a bit by bit copy), there are two possibilities in my opinion:

  • there is a hardware problem in which case a block can not be read and the cloning will stop with an error;
  • there is a logical error in the file system and then of course that error will be cloned to the new volume.

When you use a tool that claims it has “knowledge” of Btrfs, then that is not cloning (you need no “knowledge” about what bytes represent when you copy them one by one). In that case it will get stuck on the logical problem and report that, but I do not think it will be able to repair the file system.

Just some thoughts about your “cloning” suggestion, without really understanding what the problem is.

I have no idea where you got this nonsense from. Just browse mailng list archives.

Open the page you referenced, read the “Obsolete Content” label (consider what that may mean :grinning:). My information is received/gathered
from a very active channel on Libera IRC named #btrfs.

I apologize and have come to the conclusion that you are correct as per the following.

https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list.html
The obsolete content message received from url above refers to oldwiki.
New documentation is here now < https://btrfs.readthedocs.io
Here is btrfs mailing list archives < linux-btrfs.vger.kernel.org archive mirror

-Regards

Already, here is the status of my problem:

  1. I ordered a new 2TB NVME SSD.
  2. I also ordered an external SSD docking station with hardware cloning capabillity.
  3. However, I have decided NOT to clone the suspect SSD because of questions over whether cloning would work or would copy a bad block. Instead, I will just install the new SSD and reinstall the OS.
  4. I decided to run a MemTest86 test over night, even though I suspect the issue is an old SSD. Note that I have 128GB of RAM, so I expected the test to be long, but it is ridiculously long.

I ran the test overnight for 7-1/2 hours. It was still on pass 1 with 21 out of 48 tests or 43% completed. 20 out of 21 tests passed or 95%.

Where it failed and produced 156 errors was on Test 13, the “Row Hammer” test, which I have been researching today. I am attaching the MemTest86 HTML report for your edification.

The report indicates that the errors occurred up around the 85-87GB mark which would be in the 3rd bank of the 4 32GB Corsair modules I have installed. Total bits in error were 3, with maximum 1 and avg 1. 1 maximum continguous error.

I did some research on the Hammer Test over at Passmark, and apparently this test is the most extreme test. Apparently if you get errors on the first pass, you need to run a second pass. If the second pass is without errors, the program produces a warning only. If the second pass fails, the program just reports the errors, and the RAM should be considered unstable, if not actually bad.

So what I’m going to do is temporarily ignore the possibility of bad RAM and go ahead and install the new SSD and reinstall the OS.

Then I’m going to run a single Row Hammer test (since I know the first 12 tests pass fine) overnight for two passes and see whether the RAM is bad or not. If it is, I’ll have to spend $225 for new RAM because I don’t have time to run multiple tests of four RAM sticks to find the bad one, with each test taking hours to run. This is my main system and it has to work.

I’m doing this approach because of the following considerations:

  1. I have 128GB of RAM, which is a lot for a memory manager to handle.
  2. It’s spread over four modules of 2 channel RAM which means there may be issues other than bad RAM depending on the motherboard.
  3. The Row Hammer test is the most extreme test and if there are only errors in it, it merely means the RAM may be unstable in extreme use.
  4. I’ve been running this system with all new CPU, motherboard, power supply and RAM for only little over a year, without hardware issues, so the system has been fine until now. The ONLY old component is the NVME SSD. I’m assuming that even if there are RAM errors, they’re unlikely to necessarily immediately affect the new SSD. If they do occur or the MemTest Row Hammer test detects them over two or more passes, then I’ll replace the RAM.

Does this approach seem reasonable?

Here is the MemTest86 HTML Report:

EFI Specifications 2.70
System
Manufacturer System manufacturer
Product Name System Product Name
Version System Version
Serial Number System Serial Number
BIOS
Vendor American Megatrends Inc.
Version 5003
Release Date 10/07/2023
Baseboard
Manufacturer ASUSTeK COMPUTER INC.
Product Name TUF GAMING X570-PLUS (WI-FI)
Version Rev X.0x
Serial Number 221111986301191
CPU Type AMD Ryzen 9 5950X 16-Core
CPU Clock 2875 MHz [Turbo: 4868.4 MHz]
# Logical Processors 32 (16 enabled for testing)
L1 Cache 32 x 64K (285768 MB/s)
L2 Cache 32 x 512K (114785 MB/s)
L3 Cache 1 x 65536K (29799 MB/s)
Memory 130996M (28461 MB/s)
RAM Configuration DDR4 2134MT/s / x2 Channel / 15-15-15-36 / 1.200V
Number of RAM SPDs detected 4
SPD #0 32GB DDR4 XMP 2Rx8 PC4-28700
Vendor Part Info Corsair / CMK64GX4M2D3600C18
JEDEC Profile 2134MT/s 15-15-15-36 1.2V
XMP Profile 1 3596MT/s 18-22-22-42 1.35V
SPD #1 32GB DDR4 XMP 2Rx8 PC4-28700
Vendor Part Info Corsair / CMK64GX4M2D3600C18
JEDEC Profile 2134MT/s 15-15-15-36 1.2V
XMP Profile 1 3596MT/s 18-22-22-42 1.35V
SPD #2 32GB DDR4 XMP 2Rx8 PC4-28700
Vendor Part Info Corsair / CMK64GX4M2D3600C18
JEDEC Profile 2134MT/s 15-15-15-36 1.2V
XMP Profile 1 3596MT/s 18-22-22-42 1.35V
SPD #3 32GB DDR4 XMP 2Rx8 PC4-28700
Vendor Part Info Corsair / CMK64GX4M2D3600C18
JEDEC Profile 2134MT/s 15-15-15-36 1.2V
XMP Profile 1 3596MT/s 18-22-22-42 1.35V
Number of RAM slots 4
Number of RAM modules 4
DIMM B1 32GB DDR4 2Rx8 PC4-17000
Vendor Part Info Corsair / CMK64GX4M2D3600C18 / 00000000
SMBIOS Profile 2133MT/s 1.2V
DIMM B2 32GB DDR4 2Rx8 PC4-17000
Vendor Part Info Corsair / CMK64GX4M2D3600C18 / 00000000
SMBIOS Profile 2133MT/s 1.2V
DIMM A1 32GB DDR4 2Rx8 PC4-17000
Vendor Part Info Corsair / CMK64GX4M2D3600C18 / 00000000
SMBIOS Profile 2133MT/s 1.2V
DIMM A2 32GB DDR4 2Rx8 PC4-17000
Vendor Part Info Corsair / CMK64GX4M2D3600C18 / 00000000
SMBIOS Profile 2133MT/s 1.2V

Result summary

Test Start Time 2024-12-31 09:25:19
Elapsed Time 7:16:28
Memory Range Tested 0x0 - 2030000000 (131840MB)
CPU Selection Mode Parallel (All CPUs)
CPU Temperature Min/Max/Ave 44C/77C/61C
Lowest memory speed 2134 MT/s (15-15-15-36)
Highest memory speed 2134 MT/s (15-15-15-36)
# Tests Completed 21/48 (43%)
# Tests Passed 20/21 (95%)
Lowest Error Address 0x14F3799564 (85815MB)
Highest Error Address 0x156C7CD3E4 (87751MB)
Bits in Error Mask 00000000000E0000
Bits in Error 3
Max Contiguous Errors 1
CPUs that detected memory errors { 0 }

Bits in Error

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x x x . . . . . . . . . . . . . . . . .
63 0
Test # Tests Passed Errors
Test 0 [Address test, walking ones, 1 CPU] 2/2 (100%) 0
Test 1 [Address test, own address, 1 CPU] 2/2 (100%) 0
Test 2 [Address test, own address] 2/2 (100%) 0
Test 3 [Moving inversions, ones & zeroes] 2/2 (100%) 0
Test 4 [Moving inversions, 8-bit pattern] 2/2 (100%) 0
Test 5 [Moving inversions, random pattern] 2/2 (100%) 0
Test 6 [Block move, 64-byte blocks] 2/2 (100%) 0
Test 7 [Moving inversions, 32-bit pattern] 2/2 (100%) 0
Test 8 [Random number sequence] 2/2 (100%) 0
Test 9 [Modulo 20, random pattern] 1/1 (100%) 0
Test 10 [Bit fade test, 2 patterns, 1 CPU] 1/1 (100%) 0
Test 13 [Hammer test] 0/1 (0%) 156

Last 10 Errors
2024-12-31 12:32:28 - [Data Error] Test: 13, CPU: 0, Address: 15638CD2E4, Expected: B49F2A68, Actual: B49B2A68
2024-12-31 12:32:28 - [Data Error] Test: 13, CPU: 0, Address: 15360A12E4, Expected: 0C76F80A, Actual: 0C72F80A
2024-12-31 12:32:28 - [Data Error] Test: 13, CPU: 0, Address: 1534CCD2E4, Expected: 065AAB93, Actual: 065EAB93
2024-12-31 12:29:53 - [Data Error] Test: 13, CPU: 0, Address: 15588E52E4, Expected: B6C37255, Actual: B6C77255
2024-12-31 12:29:53 - [Data Error] Test: 13, CPU: 0, Address: 154BA452E4, Expected: 862CD37C, Actual: 8628D37C
2024-12-31 12:29:53 - [Data Error] Test: 13, CPU: 0, Address: 154A4792E4, Expected: B07B237C, Actual: B07F237C
2024-12-31 12:29:53 - [Data Error] Test: 13, CPU: 0, Address: 15489452E4, Expected: 1F0DEA6D, Actual: 1F09EA6D
2024-12-31 12:29:53 - [Data Error] Test: 13, CPU: 0, Address: 1539EF13E4, Expected: 8B457EC5, Actual: 8B417EC5
2024-12-31 12:29:52 - [Data Error] Test: 13, CPU: 0, Address: 1530F793E4, Expected: 750E96B1, Actual: 750A96B1
2024-12-31 12:29:52 - [Data Error] Test: 13, CPU: 0, Address: 15300453E4, Expected: C95106B1, Actual: C95506B1

Well, I went ahead and installed a new 2TB NVME SSD, and reinstalled the OS and apps on Tuesday.

Last night I was running my FreeFileSync backups and copying some images from Google Image Search on Chrome when Chrome suddenly crashed. No surprise there, it does that. However, on checking the FFS backup I see it crashed while trying to write a file to its config file on the var/app (it’s a Flatpak) in my home directory. Worse, it reported “readonly file system”. That means a BTRFS file error.

So I ran MemTest86 overnight without testing the Row Hammer test and it found five errors almost immediately in pass 1.

So now I’m assuming that I do indeed have bad RAM or some other issue relative to the workings between the memory controller, the motherboard and the RAM.

I should have expected this since 128GB of RAM is likely not as well supported as lower amounts of RAM. I initially bought 64GB of RAM, but decided to expand it since I wanted to have the potential to run large numbers of virtual machines in a computer security home lab. So I ordered another set of the same Corsair Vengeance RAM - which probably means the timings between the two sets are off.

So now I’m thinking to cut the size of the RAM in half, if I can figure out what is considered the most compatible with the ASUS X570 Tuf Gaming WIFI board and 5950X CPU.

Well, I just did some research and RAM selection is a mess. No one agrees on anything - which brand, which speed, which anything is best. It’s obviously a crap shoot.

As I was typing this, Chrome crashed again and would not restart. Other programs could launch, however. I loaded up my Systemrescue USB and ran btrfs check on the NVME - yup, errors.

So now my system is completely unstable. Everything in the BIOS has been set on Auto since I installed the system last year. No OC, no nothing. So it looks like I’ll have to buy new RAM blind.

I’ll start by removing the second set of RAM sticks I installed when I upgraded from 64GB to 128GB, since the errors appear to occur in the upper banks. Hopefully that will stabilize the system until new RAM arrives.

And now I have to reinstall the whole OS all over again after I remove the RAM

A total disaster.

Hi, your doing some good work there (unfortunate of the problems)
, your report is fairly extensive. I’m thinking about running a memtest on a machine here also, anyways I saw this today on the [https://www.memtest.org/] page:

I only have some bits wrong. Any solution?

Yes. Both Linux and Windows offer a way to exclude small bad memory areas. An additional Memtest86+ feature is planned to easily generate the required patterns for Windows.

If you exclude some bits (I have never done this) the bad memory bits on the DIMM can keep spreading though?

Thanks for your update.

Well, now I have another problem. First let me explain what I did today.

  1. I pulled the top two RAM sticks (A1 and B1) leaving A2 and B2 per my motherboard manual. Since the Memtest86 tests flagged the upper range of memory, I assume that was the problem.
  2. I ran a quick 2 passes of all the MemTest86 tests EXCEPT the Row Hammer test. Last time I ran the test it found 5 errors quickly before it even got to the Row Hammer. This time it found no errors on 2 passes. So I think the problem was confined to the upper two sticks of RAM. They were added just this past summer IIRC.

It occurs to me that the reason I was getting the crashes is from the browsers, Firefox and Chrome. Firefox was crashing tabs all the time so I switched to Chrome. Then Chrome started crashing and then I got the read only errors. We all know the browsers suck up memory like a sieve so I suspect they were the mostly likely culprits to go up into high memory and run into the bad blocks. The more tabs I had open, the more likely to hit a memory issue, while saving stuff into the BTRFS /home directly (like the cache) and the more likely to trip an error that put BTRFS into read only mode.

  1. I reinstalled the OS just now. My new problem is this: 3 of the 5 drives won’t mount. The other two mount fine.

During the install, the drives were marked not to be formatted and the partition layout was identical to the last time. Nothing was changed. They were left to be unmounted until I updated the /etc/fstab manually.

I checked the UUIDs on all of them from a list I printed out after the last hard drive I installed. The UUIDs all check.

I ran Gparted from my SystemRescue USB and it checked the file system on one of the drives which are 8TB. Other than “narrowing the extents” (whatever that means) there was no bad blocks or file errors. The other two drives are 16TB and the fschk on that Gparted is too old and doesn’t support them so won’t check them.

Why would drives that mounted fine up until the new install and which have proper fstab entries with correct UUIDs not mount?

Perhaps check /etc/crypttab contents.