BTRFS critical - corrupt leaf, unexpected item end. Multiple tries at fixing.

So I’ve had my root partition formatted with btrfs on my SSD, and recently it started giving the message in the title - after looking around, and mistakenly running btrfs check --repair /dev/sda5 through a Ubuntu live USB caused it to fully breakdown. So, I reinstalled openSuse tumbleweed, making sure to reformat the root partition. For what it’s worth, running GParted from the Ubuntu live USB says that the device descriptor is 2048 bytes, whilst linux reports it as 512 bytes. Googling this error didn’t get me to much place useful, but maybe it is relevant to this problem I’m having.

I reinstalled openSuse and reformatted my SSD / partition 2 days ago, and now it is still giving me these error messages:

Output of dmesg | grep BTRFS :

    2.783614] BTRFS: device fsid fe9d7ed0-5935-4368-8d3f-78afa44986fe devid 1 transid 5586 /dev/sdb5
    2.785905] BTRFS: device fsid 4fd0fb30-7b5e-48eb-ad9b-bd456cb53da1 devid 1 transid 635 /dev/sda5
    2.807315] BTRFS info (device sda5): disk space caching is enabled
    2.807316] BTRFS info (device sda5): has skinny extents
    2.813008] BTRFS info (device sda5): enabling ssd optimizations
    3.216370] BTRFS info (device sda5): disk space caching is enabled
    3.532714] BTRFS info (device sdb5): disk space caching is enabled
    3.532716] BTRFS info (device sdb5): has skinny extents
    5.156064] BTRFS critical (device sda5): corrupt leaf: root=1 block=5811896320 slot=3, unexpected item end, have 729628220 expect 16109

Output of sudo btrfs scrub start / :

scrub status for 4fd0fb30-7b5e-48eb-ad9b-bd456cb53da1
        scrub started at Wed Apr 25 14:03:52 2018 and finished after 00:00:21
        total bytes scrubbed: 10.41GiB with 0 errors

Output of sudo smartctl -a /dev/sda :

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.16.2-1-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     SanDisk SDSSDX120GG25
Serial Number:    131086402463
LU WWN Device Id: 5 001b44 990f0bf9f
Firmware Version: R211
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr 25 14:05:26 2018 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  48) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0021) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   117   050    Pre-fail  Always       -       0/167098116
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   085   085   000    Old_age   Always       -       13454h+29m+12.720s
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       2792
171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       361
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       6
181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   029   044   000    Old_age   Always       -       29 (Min/Max 9/44)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/167098116
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       0/167098116
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       0/167098116
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0013   099   099   010    Pre-fail  Always       -       0
233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       14550
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       11632
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       11632
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       21378

SMART Error Log not supported

SMART Self-test Log not supported

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


So I’m not sure what exactly is going wrong. I turned off my overclocks for my CPU and RAM as well. Any advice on where to go from here would be helpful :slight_smile:

Running sudo journalctl -f and I’m getting a few messages regarding my hard drive - not sure if incredibly relevant to the issue with btrfs however, I’d imagine it wouldn’t be:


smartd[970]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 117
smartd[970]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 30 to 28
smartd[970]: Device: /dev/sda [SAT], 2096 Currently unreadable (pending) sectors (changed +8)
smartd[970]: Device: /dev/sda [SAT], 2096 Offline uncorrectable sectors (changed +8)
smartd[970]: Device: /dev/sda [SAT], Failed SMART usage Attribute: 184 End-to-End_Error.
smartd[970]: Device: /dev/sda [SAT], ATA error count increased from 322 to 328

after this, no more meaningful messages come out for a while (no output for a good half an hour).

Since this is an SSD disk,
How have you set up TRIM operations?

And, if you haven’t, ironically when you formatted your disk you probably aggravated a possible problem making all the SSD traps unavailable for writing.

TSU

I do remember setting up TRIM on Windows if that helps :stuck_out_tongue:

But no, I have never manually set up TRIM on my linux installations for my SSD - the partitions are mounted in my fstab with defaults 0 0, and a quick googling shows I should mount it as defaults,discard. Please confirm this?

And if this is the case, how should I fix it? Do I reinstall and reformat again, and edit my fstab file to mount all partitions on my SSD with TRIM enabled? As I don’t recall there being an option in the actual installation disk itself.

[SUB]Also running sudo smartctl -a /dev/sdb on my hard drive says that End-to-End_Error is failing now. Again, not sure how relevant this would be to my SSD btrfs partition but I suppose if I could knock out both problems at once that would be helpful. I do actually have a new hard drive lying around that I could clone over though, if it truly is irreparable. [/SUB]

Take a test of integrity with GSmartControl

And then post the result, so someone will know what to advise you

The program is already in the Repo

https://gsmartcontrol.sourceforge.io/home/index.php/Screenshots
https://gsmartcontrol.sourceforge.io/home/index.php/

GSmartControl reports no error, at an extended self-test.

I set the ahci SATA controllers mode in the bios
And then I modified the fstab file by adding noatime- discard

I do not use btrfs and I can not blame it

Wait so why are you commenting on this?

Can you give your fstab line of adding that flag?

And yes, I have set AHCI mode in BIOS.

I simply listed what I did installing with a Ssd: enable ahci in the bios and add noatime and discard in fstab.
Maybe the installation ISO is corrupted

My fstab

UUID=21f4840c-ff3d-48f0-a36a-5578ce6041c1 /                    ext4       acl,user_xattr,noatime,discard        1 1
UUID=4a79e528-7bbb-4357-ac42-c6b1b4796448 /archivio            ext4       defaults              1 2
UUID=2568bd9d-67b8-4cc3-9c61-6230b0b2d5a1 /boot                ext4       acl,user_xattr,noatime,discard        1 2
UUID=42b49c5f-a050-4506-b9b5-7b66fa9c071c /home                ext4       acl,user_xattr,noatime,discard        1 2
UUID=da51f0b3-b149-423f-a6a8-0556e79dc257 /tmp                 ext4       acl,user_xattr        1 2
UUID=3216c0a4-023c-4f56-8298-31f4e39c5ac3 /var                 ext4       acl,user_xattr        1 2

I don’t think the installation ISO would be problematic. And not sure if those flags would be helpful for btrfs. Thank you for your input though.

Looks like failed SSD blocks to me. What happens if you destroy / re-write the entire drive with “dd if=/dev/zero of=/dev/… bs=64K” and reinstall? Does it realloc the pending sector count?

I suppose that will be my next course of action if everything fails. I have Windows dual booted on the SSD which I’m happy to get rid of, but I’m not sure if this is the best option?

Also I was under the impression having a larger block size is always better, so shouldn’t I do dd with bs=1M?

My opinion:

The “dd” command will wipe your ssd (overwrite it with “0”).
Maybe you need to clear your memory cell. This is the link (explains better than I do):
https://wiki.archlinux.org/index.php/Solid_State_Drive/Memory_cell_clearing

But maybe is better to wait for another opinion.

I will probably go over my SSD with dd, setting block size to 64K as previously recommended. Doing a bit of research, I don’t think clearing memory cells is going to help, as my SSD is old-ish now (I built this PC in 2013).

Failed pending sectors are remapped to spare space when they are written. You have a bunch though (2096). If you still have pending after a complete write, then probably toss the drive in the trash (no more spare space left to remap failed blocks). After a complete write, do a complete read, then check smart state for any pending.

Since you have an existing problem, manually TRIM to recover SSD traps, you may not be able to simply wait for automated processes to work.
And, yes the following describes where the discard option can be invoked in your fstab (and elsewhere)

The ArchWiki for Solid State Drives
https://wiki.archlinux.org/index.php/Solid_State_Drive

The Archwiki for BTRFS (more general questions and understanding)
https://wiki.archlinux.org/index.php/Btrfs

Since the information in the ArchWiki articles are continuously updated, I wouldn’t ever rely entirely on my memory, with every install I highly recommend re-reading the articles.

TSU-

When dealing with SSDs, everything about memory traps is always relevant.
Don’t dd your drive, or if you do so understand that unless you address clearing the traps, you’re only making your problem worse (like you might have done when you formatted).
Block size is irrelevant and why would anyone set to 64k?
Disk block sizes (logical blocks in the case of SSD) affect other things like data storage efficiency and performance, and optimal size is related to the size of your partitions and the typical sizes of your files. Those have nothing to do with what you are currently trying to fix, but could have some effect on your system <afterwards> when your system is running normally.
Note that specifying a block size when doing dd is different… You can make the writing go faster, but unless you address, there will be a larger remainder which won’t be written.

TSU

Write entire drive. Read entire drive. Check if any pending remain. Trim won’t help failed allocated blocks.