Potentially bad SSD(s) troubleshooting

Summary:
My filesystem is becoming read-only while gaming, but I can’t replicate an issue in any other context and I can’t identify any hardware cause.

Detail:
I have two BTRFS volumes on my install: one with fast(er) SSDs, and one of slow(er) HDDs

Label: none  uuid: d348388c-1503-4336-b353-ab2e5cda424f
        Total devices 5 FS bytes used 151.81GiB
        devid    1 size 232.38GiB used 0.00B path /dev/nvme0n1p3
        devid    2 size 1.77TiB used 95.03GiB path /dev/nvme1n1p2
        devid    3 size 931.51GiB used 0.00B path /dev/sda1
        devid    4 size 465.76GiB used 0.00B path /dev/sdc1
        devid    5 size 1.82TiB used 137.03GiB path /dev/sde1

Label: none  uuid: 5df05a5b-d0ce-4ff1-bf11-61a2e3f2e872
        Total devices 2 FS bytes used 3.86TiB
        devid    1 size 1.82TiB used 5.02GiB path /dev/sdb1
        devid    2 size 12.73TiB used 3.87TiB path /dev/sdd1

Games and / are both on the first, and /home is on the second. I also use the SSD volume for video editing and storing of raw files.

For the last two weeks or so I have been getting an error in Windows games where the games will stop being able to access the drive, the filesystem on / will become read-only and I will be unable to do anything but hard restart. /home is still accessible at this time, and the game is usually able to limp along for a minute or so before it crashes.

I say “Windows games” because in native Linux games that I have tested this does not happen. It doesn’t happen while video editing. There was no issue when I copied about 3 TB of video files to the drives and randomly scrubbed through them, but multiple Windows games in both Proton (via Steam) and Wine (via Lutris) will reliable produce this error.

I ran Memtest86+ for 22 hours without any errors being reported.

Below is the output of smartctl for all SSDs

trillian@nazara:/dev> sudo smartctl -a /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.8-1-default] (SUSE RPM)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 960 EVO 250GB
Serial Number:                      S3ESNX0JB41433D
Firmware Version:                   3B7QCXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 250,059,350,016 [250 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          250,059,350,016 [250 GB]
Namespace 1 Utilization:            16,760,832 [16.7 MB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5b71b14a36
Local Time is:                      Wed Jan  8 00:05:22 2025 GMT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     79 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.04W       -        -    0  0  0  0        0       0
 1 +     5.09W       -        -    1  1  1  1        0       0
 2 +     4.08W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    27%
Data Units Read:                    89,670,094 [45.9 TB]
Data Units Written:                 112,967,270 [57.8 TB]
Host Read Commands:                 787,604,819
Host Write Commands:                1,378,171,547
Controller Busy Time:               6,476
Power Cycles:                       2,512
Power On Hours:                     11,143
Unsafe Shutdowns:                   228
Media and Data Integrity Errors:    0
Error Information Log Entries:      16,834
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               41 Celsius
Temperature Sensor 2:               51 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0      16834     0  0x0010  0x4004      -            0     0     -  Invalid Field in Command
  1      16833     0  0x0010  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
  2      16832     0  0x000c  0x4004      -            0     0     -  Invalid Field in Command
  3      16831     0  0x0010  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
  4      16830     0  0x101d  0x4004      -            0     0     -  Invalid Field in Command
  5      16829     0  0x0010  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
  6      16828     0  0x000c  0x4004      -            0     0     -  Invalid Field in Command
  7      16827     0  0x0010  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
  8      16826     0  0x0000  0x4004      -            0     0     -  Invalid Field in Command
  9      16825     0  0x0010  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
 10      16824     0  0x0009  0x4004      -            0     0     -  Invalid Field in Command
 11      16823     0  0x0010  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
 12      16822     0  0x000c  0x4004      -            0     0     -  Invalid Field in Command
 13      16821     0  0x0010  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
 14      16820     0  0x0004  0x4004      -            0     0     -  Invalid Field in Command
 15      16819     0  0x0010  0x421a  0x028            0     0     -  Feature Identifier Not Saveable
... (48 entries not read)

Self-tests not supported

trillian@nazara:/dev> sudo smartctl -a /dev/nvme1n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.8-1-default] (SUSE RPM)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 980 PRO 2TB
Serial Number:                      S69ENF0R405296E
Firmware Version:                   2B2QGXA7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            174,215,770,112 [174 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 b411b2e5d6
Local Time is:                      Wed Jan  8 00:07:34 2025 GMT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.49W       -        -    0  0  0  0        0       0
 1 +     4.48W       -        -    1  1  1  1        0     200
 2 +     3.18W       -        -    2  2  2  2        0    1000
 3 -   0.0400W       -        -    3  3  3  3     2000    1200
 4 -   0.0050W       -        -    4  4  4  4      500    9500

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        44 Celsius
Available Spare:                    92%
Available Spare Threshold:          10%
Percentage Used:                    4%
Data Units Read:                    764,346,789 [391 TB]
Data Units Written:                 113,592,016 [58.1 TB]
Host Read Commands:                 2,598,820,730
Host Write Commands:                792,578,272
Controller Busy Time:               20,368
Power Cycles:                       1,322
Power On Hours:                     5,165
Unsafe Shutdowns:                   282
Media and Data Integrity Errors:    879
Error Information Log Entries:      879
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               44 Celsius
Temperature Sensor 2:               49 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Read Self-test Log failed: Invalid Field in Command (0x002)

trillian@nazara:/dev> sudo smartctl -a /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.8-1-default] (SUSE RPM)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 840 EVO 1TB
Serial Number:    S1D9NSAF415996D
LU WWN Device Id: 5 002538 8a03d3c66
Firmware Version: EXT0DB6Q
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jan  8 00:07:55 2025 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (15000) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 250) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       57433
 12 Power_Cycle_Count       0x0032   095   095   000    Old_age   Always       -       4182
177 Wear_Leveling_Count     0x0013   082   082   000    Pre-fail  Always       -       210
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   001   010    Pre-fail  Always   In_the_past 0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   068   029   000    Old_age   Always       -       32
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       -       5
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       785
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       199131988495

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         0         -
# 2  Short offline       Completed without error       00%         0         -
# 3  Short offline       Completed without error       00%        70         -
# 4  Extended offline    Completed without error       00%        70         -
# 5  Short offline       Completed without error       00%        47         -
# 6  Short offline       Completed without error       00%        22         -
# 7  Short offline       Completed without error       00%         4         -
# 8  Short offline       Completed without error       00%        39         -
# 9  Short offline       Completed without error       00%        39         -
#10  Short offline       Completed without error       00%         0         -
#11  Short offline       Completed without error       00%       126         -
#12  Short offline       Completed without error       00%       102         -
#13  Short offline       Completed without error       00%        78         -
#14  Short offline       Completed without error       00%        54         -
#15  Short offline       Completed without error       00%        30         -
#16  Short offline       Completed without error       00%         6         -
#17  Short offline       Completed without error       00%        14         -
#18  Short offline       Completed without error       00%         0         -
#19  Short offline       Completed without error       00%        11         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

trillian@nazara:/dev> sudo smartctl -a /dev/sdc
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.8-1-default] (SUSE RPM)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 840 EVO 500GB
Serial Number:    S1DHNSAF669326F
LU WWN Device Id: 5 002538 8a0503934
Firmware Version: EXT0DB6Q
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jan  8 00:08:31 2025 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 6600) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 110) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   001   010    Pre-fail  Always   In_the_past 0
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       55305
 12 Power_Cycle_Count       0x0032   096   096   000    Old_age   Always       -       3759
177 Wear_Leveling_Count     0x0013   094   094   000    Pre-fail  Always       -       62
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   001   010    Pre-fail  Always   In_the_past 0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   001   010    Pre-fail  Always   In_the_past 0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   070   026   000    Old_age   Always       -       30
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       341
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       53324060629

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         0         -
# 2  Short offline       Completed without error       00%         0         -
# 3  Short offline       Completed without error       00%        70         -
# 4  Extended offline    Completed without error       00%        69         -
# 5  Short offline       Completed without error       00%        47         -
# 6  Short offline       Completed without error       00%        22         -
# 7  Short offline       Completed without error       00%         4         -
# 8  Short offline       Completed without error       00%        39         -
# 9  Short offline       Completed without error       00%        39         -
#10  Short offline       Completed without error       00%         0         -
#11  Short offline       Completed without error       00%       126         -
#12  Short offline       Completed without error       00%       102         -
#13  Short offline       Completed without error       00%        78         -
#14  Short offline       Completed without error       00%        54         -
#15  Short offline       Completed without error       00%        30         -
#16  Short offline       Completed without error       00%         6         -
#17  Short offline       Completed without error       00%        14         -
#18  Short offline       Completed without error       00%         0         -
#19  Short offline       Completed without error       00%        11         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

trillian@nazara:/dev> sudo smartctl -a /dev/sde
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.8-1-default] (SUSE RPM)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 870 EVO 2TB
Serial Number:    S621NF0R816434T
LU WWN Device Id: 5 002538 f41823907
Firmware Version: SVT01B6Q
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jan  8 00:08:52 2025 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 160) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   099   099   010    Pre-fail  Always       -       1
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       12071
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       1052
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       20
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   099   099   010    Pre-fail  Always       -       1
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   099   099   010    Pre-fail  Always       -       1
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   068   046   000    Old_age   Always       -       32
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   075   075   000    Old_age   Always       -       25226
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       80
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       59546445105

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Offline             Completed without error       00%     12065         -
# 2  Short offline       Completed without error       00%     12060         -
# 3  Extended offline    Completed without error       00%     12038         -
# 4  Short offline       Completed without error       00%     12014         -
# 5  Short offline       Completed without error       00%     11990         -
# 6  Short offline       Completed without error       00%     11966         -
# 7  Short offline       Completed without error       00%     11949         -
# 8  Short offline       Completed without error       00%     11949         -
# 9  Short offline       Completed without error       00%     11910         -
#10  Short offline       Completed without error       00%     11888         -
#11  Short offline       Completed without error       00%     11864         -
#12  Short offline       Completed without error       00%     11840         -
#13  Short offline       Completed without error       00%     11816         -
#14  Short offline       Completed without error       00%     11792         -
#15  Short offline       Completed without error       00%     11768         -
#16  Short offline       Completed without error       00%     11744         -
#17  Short offline       Completed without error       00%     11731         -
#18  Short offline       Completed without error       00%     11717         -
#19  Short offline       Completed without error       00%     11707         -
#20  Short offline       Completed without error       00%     11698         -
#21  Short offline       Completed without error       00%     11677         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  256        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

I’ll post the btrfs check output as a reply to this post due to character limits,but as far as I can tell that’s fine too

I tested Baldur’s Gate 3 (a game which consistently reproduces this error on the SSD array) installed to the /home HDDs and it was fine (albeit painfully slow).

At this point I do believe that there is some issue with at least one of my SSDs, but I can’t determine which one, and as far as I can tell all of the tests are saying that they’re fine (and at any rate, for every single use that is not Wine/Proton related they do seem to be. If there is a hardware issue, would it not be apparent in more contexts than this?)

I’m at a loss as to where to go to further troubleshoot, so any advice would be appreciated.

Thanks

Below is the output of btrfs check

trillian@nazara:/dev> sudo btrfs check --force nvme0n1p3
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on nvme0n1p3
UUID: d348388c-1503-4336-b353-ab2e5cda424f
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
Rescan hasn't been initialzied, a difference in qgroup accounting is expected
Counts for qgroup id: 0/257 are different
our:            referenced 10989285376 referenced compressed 10989285376
disk:           referenced 10998145024 referenced compressed 10998145024
diff:           referenced -8859648 referenced compressed -8859648
our:            exclusive 10989285376 exclusive compressed 10989285376
disk:           exclusive 10998145024 exclusive compressed 10998145024
diff:           exclusive -8859648 exclusive compressed -8859648
Counts for qgroup id: 0/260 are different
our:            referenced 12288000 referenced compressed 12288000
disk:           referenced 12292096 referenced compressed 12292096
diff:           referenced -4096 referenced compressed -4096
our:            exclusive 12288000 exclusive compressed 12288000
disk:           exclusive 12292096 exclusive compressed 12292096
diff:           exclusive -4096 exclusive compressed -4096
Counts for qgroup id: 0/262 are different
our:            referenced 105827209216 referenced compressed 105827209216
disk:           referenced 105827504128 referenced compressed 105827504128
diff:           referenced -294912 referenced compressed -294912
our:            exclusive 105823268864 exclusive compressed 105823268864
disk:           exclusive 105823563776 exclusive compressed 105823563776
diff:           exclusive -294912 exclusive compressed -294912
Counts for qgroup id: 0/1651 are different
our:            referenced 15659401216 referenced compressed 15659401216
disk:           referenced 15659401216 referenced compressed 15659401216
our:            exclusive 49152 exclusive compressed 49152
disk:           exclusive 16384 exclusive compressed 16384
diff:           exclusive 32768 exclusive compressed 32768
Counts for qgroup id: 1/0 are different
our:            referenced 32280817664 referenced compressed 32280817664
disk:           referenced 32269066240 referenced compressed 32269066240
diff:           referenced 11751424 referenced compressed 11751424
our:            exclusive 16621432832 exclusive compressed 16621432832
disk:           exclusive 16621334528 exclusive compressed 16621334528
diff:           exclusive 98304 exclusive compressed 98304
found 149414977536 bytes used, no error found
total csum bytes: 134963136
total tree bytes: 1907982336
total fs tree bytes: 1628717056
total extent tree bytes: 93388800
btree space waste bytes: 472119566
file data blocks allocated: 292656271360
 referenced 231270973440
trillian@nazara:/dev> sudo btrfs check --force nvme1n1p2
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on nvme1n1p2
UUID: d348388c-1503-4336-b353-ab2e5cda424f
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
Rescan hasn't been initialzied, a difference in qgroup accounting is expected
Counts for qgroup id: 0/257 are different
our:            referenced 10989301760 referenced compressed 10989301760
disk:           referenced 10998145024 referenced compressed 10998145024
diff:           referenced -8843264 referenced compressed -8843264
our:            exclusive 10989301760 exclusive compressed 10989301760
disk:           exclusive 10998145024 exclusive compressed 10998145024
diff:           exclusive -8843264 exclusive compressed -8843264
Counts for qgroup id: 0/260 are different
our:            referenced 12288000 referenced compressed 12288000
disk:           referenced 12292096 referenced compressed 12292096
diff:           referenced -4096 referenced compressed -4096
our:            exclusive 12288000 exclusive compressed 12288000
disk:           exclusive 12292096 exclusive compressed 12292096
diff:           exclusive -4096 exclusive compressed -4096
Counts for qgroup id: 0/262 are different
our:            referenced 105827278848 referenced compressed 105827278848
disk:           referenced 105827504128 referenced compressed 105827504128
diff:           referenced -225280 referenced compressed -225280
our:            exclusive 105823338496 exclusive compressed 105823338496
disk:           exclusive 105823563776 exclusive compressed 105823563776
diff:           exclusive -225280 exclusive compressed -225280
Counts for qgroup id: 0/1651 are different
our:            referenced 15659401216 referenced compressed 15659401216
disk:           referenced 15659401216 referenced compressed 15659401216
our:            exclusive 49152 exclusive compressed 49152
disk:           exclusive 16384 exclusive compressed 16384
diff:           exclusive 32768 exclusive compressed 32768
Counts for qgroup id: 1/0 are different
our:            referenced 32280817664 referenced compressed 32280817664
disk:           referenced 32269066240 referenced compressed 32269066240
diff:           referenced 11751424 referenced compressed 11751424
our:            exclusive 16621432832 exclusive compressed 16621432832
disk:           exclusive 16621334528 exclusive compressed 16621334528
diff:           exclusive 98304 exclusive compressed 98304
found 149415063552 bytes used, no error found
total csum bytes: 134963292
total tree bytes: 1907998720
total fs tree bytes: 1628733440
total extent tree bytes: 93388800
btree space waste bytes: 472130725
file data blocks allocated: 292655816704
 referenced 231271038976
trillian@nazara:/dev> sudo btrfs check --force sda1
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on sda1
UUID: d348388c-1503-4336-b353-ab2e5cda424f
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
Rescan hasn't been initialzied, a difference in qgroup accounting is expected
Counts for qgroup id: 0/257 are different
our:            referenced 10989305856 referenced compressed 10989305856
disk:           referenced 10998145024 referenced compressed 10998145024
diff:           referenced -8839168 referenced compressed -8839168
our:            exclusive 10989305856 exclusive compressed 10989305856
disk:           exclusive 10998145024 exclusive compressed 10998145024
diff:           exclusive -8839168 exclusive compressed -8839168
Counts for qgroup id: 0/260 are different
our:            referenced 12288000 referenced compressed 12288000
disk:           referenced 12292096 referenced compressed 12292096
diff:           referenced -4096 referenced compressed -4096
our:            exclusive 12288000 exclusive compressed 12288000
disk:           exclusive 12292096 exclusive compressed 12292096
diff:           exclusive -4096 exclusive compressed -4096
Counts for qgroup id: 0/262 are different
our:            referenced 105827397632 referenced compressed 105827397632
disk:           referenced 105827504128 referenced compressed 105827504128
diff:           referenced -106496 referenced compressed -106496
our:            exclusive 105823457280 exclusive compressed 105823457280
disk:           exclusive 105823563776 exclusive compressed 105823563776
diff:           exclusive -106496 exclusive compressed -106496
Counts for qgroup id: 0/1651 are different
our:            referenced 15659401216 referenced compressed 15659401216
disk:           referenced 15659401216 referenced compressed 15659401216
our:            exclusive 49152 exclusive compressed 49152
disk:           exclusive 16384 exclusive compressed 16384
diff:           exclusive 32768 exclusive compressed 32768
Counts for qgroup id: 1/0 are different
our:            referenced 32280817664 referenced compressed 32280817664
disk:           referenced 32269066240 referenced compressed 32269066240
diff:           referenced 11751424 referenced compressed 11751424
our:            exclusive 16621432832 exclusive compressed 16621432832
disk:           exclusive 16621334528 exclusive compressed 16621334528
diff:           exclusive 98304 exclusive compressed 98304
found 149415186432 bytes used, no error found
total csum bytes: 134963392
total tree bytes: 1908015104
total fs tree bytes: 1628749824
total extent tree bytes: 93388800
btree space waste bytes: 472144144
file data blocks allocated: 292655923200
 referenced 231271108608
trillian@nazara:/dev> sudo btrfs check --force sdc1
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on sdc1
UUID: d348388c-1503-4336-b353-ab2e5cda424f
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
Rescan hasn't been initialzied, a difference in qgroup accounting is expected
Counts for qgroup id: 0/257 are different
our:            referenced 10989305856 referenced compressed 10989305856
disk:           referenced 10998145024 referenced compressed 10998145024
diff:           referenced -8839168 referenced compressed -8839168
our:            exclusive 10989305856 exclusive compressed 10989305856
disk:           exclusive 10998145024 exclusive compressed 10998145024
diff:           exclusive -8839168 exclusive compressed -8839168
Counts for qgroup id: 0/260 are different
our:            referenced 12288000 referenced compressed 12288000
disk:           referenced 12292096 referenced compressed 12292096
diff:           referenced -4096 referenced compressed -4096
our:            exclusive 12288000 exclusive compressed 12288000
disk:           exclusive 12292096 exclusive compressed 12292096
diff:           exclusive -4096 exclusive compressed -4096
Counts for qgroup id: 0/262 are different
our:            referenced 105827762176 referenced compressed 105827762176
disk:           referenced 105827504128 referenced compressed 105827504128
diff:           referenced 258048 referenced compressed 258048
our:            exclusive 105823821824 exclusive compressed 105823821824
disk:           exclusive 105823563776 exclusive compressed 105823563776
diff:           exclusive 258048 exclusive compressed 258048
Counts for qgroup id: 0/1651 are different
our:            referenced 15659401216 referenced compressed 15659401216
disk:           referenced 15659401216 referenced compressed 15659401216
our:            exclusive 49152 exclusive compressed 49152
disk:           exclusive 16384 exclusive compressed 16384
diff:           exclusive 32768 exclusive compressed 32768
Counts for qgroup id: 1/0 are different
our:            referenced 32280817664 referenced compressed 32280817664
disk:           referenced 32269066240 referenced compressed 32269066240
diff:           referenced 11751424 referenced compressed 11751424
our:            exclusive 16621432832 exclusive compressed 16621432832
disk:           exclusive 16621334528 exclusive compressed 16621334528
diff:           exclusive 98304 exclusive compressed 98304
found 149415550976 bytes used, no error found
total csum bytes: 134963748
total tree bytes: 1908015104
total fs tree bytes: 1628749824
total extent tree bytes: 93388800
btree space waste bytes: 472141970
file data blocks allocated: 292656287744
 referenced 231271251968
trillian@nazara:/dev> sudo btrfs check --force sde1
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on sde1
UUID: d348388c-1503-4336-b353-ab2e5cda424f
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
Rescan hasn't been initialzied, a difference in qgroup accounting is expected
Counts for qgroup id: 0/257 are different
our:            referenced 10989305856 referenced compressed 10989305856
disk:           referenced 10998145024 referenced compressed 10998145024
diff:           referenced -8839168 referenced compressed -8839168
our:            exclusive 10989305856 exclusive compressed 10989305856
disk:           exclusive 10998145024 exclusive compressed 10998145024
diff:           exclusive -8839168 exclusive compressed -8839168
Counts for qgroup id: 0/260 are different
our:            referenced 12288000 referenced compressed 12288000
disk:           referenced 12292096 referenced compressed 12292096
diff:           referenced -4096 referenced compressed -4096
our:            exclusive 12288000 exclusive compressed 12288000
disk:           exclusive 12292096 exclusive compressed 12292096
diff:           exclusive -4096 exclusive compressed -4096
Counts for qgroup id: 0/262 are different
our:            referenced 105827635200 referenced compressed 105827635200
disk:           referenced 105827504128 referenced compressed 105827504128
diff:           referenced 131072 referenced compressed 131072
our:            exclusive 105823694848 exclusive compressed 105823694848
disk:           exclusive 105823563776 exclusive compressed 105823563776
diff:           exclusive 131072 exclusive compressed 131072
Counts for qgroup id: 0/1651 are different
our:            referenced 15659401216 referenced compressed 15659401216
disk:           referenced 15659401216 referenced compressed 15659401216
our:            exclusive 49152 exclusive compressed 49152
disk:           exclusive 16384 exclusive compressed 16384
diff:           exclusive 32768 exclusive compressed 32768
Counts for qgroup id: 1/0 are different
our:            referenced 32280817664 referenced compressed 32280817664
disk:           referenced 32269066240 referenced compressed 32269066240
diff:           referenced 11751424 referenced compressed 11751424
our:            exclusive 16621432832 exclusive compressed 16621432832
disk:           exclusive 16621334528 exclusive compressed 16621334528
diff:           exclusive 98304 exclusive compressed 98304
found 149415440384 bytes used, no error found
total csum bytes: 134963624
total tree bytes: 1908031488
total fs tree bytes: 1628749824
total extent tree bytes: 93405184
btree space waste bytes: 472156528
file data blocks allocated: 292656160768
 referenced 231271333888

@cakeisamadeupdrug Well the 840 is blacklisted (line 4136 https://github.com/torvalds/linux/blob/master/drivers/ata/libata-core.c) for trim, implications, have no idea… the 980 should be fine…

You need to check the manufacturers web site specs for what the attributes really mean.

Based on what? You did not show a single error message that would point at SSD as the root cause. Actually, you did not show anything that would allow to even start troubleshooting.

Show the full dmesg output after filesystem goes read-only. Upload to the https://paste.opensuse.org/


Thank you for your help. I am going to keep trying to do as you ask, but it’s difficult to type anything when the terminal is filling up with this message.

ok with some tricky order of operations stuff I managed to keep bash, kate and the system in general functioning long enough to copy and paste dmesg

The system is saying that one of my nvme controllers is faulty

Buffer I/O error on dev sr0, logical block 0, async page read
[  117.970457] [   T4586] sr 3:0:0:0: [sr0] tag#4 unaligned transfer
[  117.970459] [   T4586] I/O error, dev sr0, sector 1 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  117.970463] [   T4586] Buffer I/O error on dev sr0, logical block 1, async page read
[  117.970468] [   T4586] sr 3:0:0:0: [sr0] tag#5 unaligned transfer
[  117.970471] [   T4586] I/O error, dev sr0, sector 2 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  117.970474] [   T4586] Buffer I/O error on dev sr0, logical block 2, async page read
[  117.970479] [   T4586] sr 3:0:0:0: [sr0] tag#6 unaligned transfer
[  117.970481] [   T4586] I/O error, dev sr0, sector 3 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  117.970485] [   T4586] Buffer I/O error on dev sr0, logical block 3, async page read
[  117.970490] [   T4586] sr 3:0:0:0: [sr0] tag#7 unaligned transfer
[  117.970492] [   T4586] I/O error, dev sr0, sector 4 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  117.970496] [   T4586] Buffer I/O error on dev sr0, logical block 4, async page read
[  117.970501] [   T4586] sr 3:0:0:0: [sr0] tag#8 unaligned transfer
[  117.970503] [   T4586] I/O error, dev sr0, sector 5 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  117.970506] [   T4586] Buffer I/O error on dev sr0, logical block 5, async page read
[  117.970512] [   T4586] sr 3:0:0:0: [sr0] tag#9 unaligned transfer
[  117.970514] [   T4586] I/O error, dev sr0, sector 6 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  117.970517] [   T4586] Buffer I/O error on dev sr0, logical block 6, async page read
[  117.970523] [   T4586] sr 3:0:0:0: [sr0] tag#10 unaligned transfer
[  117.970525] [   T4586] I/O error, dev sr0, sector 7 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  117.970528] [   T4586] Buffer I/O error on dev sr0, logical block 7, async page read
[  232.837388] [    T292] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[  232.837397] [    T292] nvme nvme0: Does your device have a faulty power saving mode enabled?
[  232.837399] [    T292] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[  232.864140] [    T292] nvme0n1: I/O Cmd(0x2) @ LBA 339313520, 104 blocks, I/O Error (sct 0x3 / sc 0x71)
[  232.864151] [    T292] I/O error, dev nvme0n1, sector 339313520 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  232.864162] [    T292] nvme0n1: I/O Cmd(0x2) @ LBA 356568776, 1024 blocks, I/O Error (sct 0x3 / sc 0x71)
[  232.864166] [    T292] I/O error, dev nvme0n1, sector 356568776 op 0x0:(READ) flags 0x84700 phys_seg 1 prio class 0
[  232.864173] [    T292] nvme0n1: I/O Cmd(0x2) @ LBA 356569800, 384 blocks, I/O Error (sct 0x3 / sc 0x71)
[  232.864176] [    T292] I/O error, dev nvme0n1, sector 356569800 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 0
[  232.864182] [    T292] nvme0n1: I/O Cmd(0x2) @ LBA 356570208, 360 blocks, I/O Error (sct 0x3 / sc 0x71)
[  232.864185] [    T292] I/O error, dev nvme0n1, sector 356570208 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  232.864191] [    T292] nvme0n1: I/O Cmd(0x2) @ LBA 356570592, 152 blocks, I/O Error (sct 0x3 / sc 0x71)
[  232.864194] [    T292] I/O error, dev nvme0n1, sector 356570592 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  232.864201] [    T292] nvme0n1: I/O Cmd(0x2) @ LBA 356570840, 1024 blocks, I/O Error (sct 0x3 / sc 0x71)
[  232.864204] [    T292] I/O error, dev nvme0n1, sector 356570840 op 0x0:(READ) flags 0x84700 phys_seg 1 prio class 0
[  232.864209] [    T292] nvme0n1: I/O Cmd(0x2) @ LBA 356571864, 304 blocks, I/O Error (sct 0x3 / sc 0x71)
[  232.864212] [    T292] I/O error, dev nvme0n1, sector 356571864 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  232.881403] [    T830] nvme 0000:04:00.0: enabling device (0000 -> 0002)
[  232.881522] [    T830] nvme nvme0: Disabling device after reset failure: -19

ngl I have not got a clue what Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off"actually means (ie where or how to “try” it). Googling it is revealing an awful lot of similar NVME issues since July 2024 however; I have no idea if this is related. I assumed rollback didn’t work because it was a physical problem with my drive (which it may well be). However if these posts I’m seeing are indicative of some kernel issue with certain NVME controllers I might just not have a kernel old enough to roll back to… I also have no idea why it’s absolutely fine until the second I launch a game in Wine.

@cakeisamadeupdrug Hi,

At boot (since your system goes read only), when you see the Grub menu. You can press the ‘e’ key to edit, then use the arrow keys to scroll down to the line startin linux or linuxefi and then press the end key and now add the boot options nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off then you can press F10 to boot and it will temporarily add them… Not sure it will help though.

Suggest you wait for more feedback from @arvidjaar before doing anything.

If the nvme drive is failing - you see validate by reading every sector with a dd command

sudo dd if=/dev/nvme0n1 of=/dev/null status=progress

it will stop and say what sector is bad or it will complete with end of file

do not use a bs= to make it go faster - it will not say what sector is bad but what block is bad.

this also works with hard drive use if=/dev/sda (or what drive you want to check)

it will slow your system down while it is reading every sector.

Yes, your NVMe device just disappeared.

Educated guess is that game stresses the device by loading a large amount of data.

You could try the suggested kernel parameters. If it does not help - the easiest is to replace this NVMe device.

I will experiment later when I have a minute. In the meantime, is there an easy way to see which of the two NVME drives it is?

Edit: I just re-read, it said nvme0 lmao. I’ll see which one that is xD

ngl I’m kind of surprised the system is as usable as it is – albeit read only – if a massive chunk of the / directory is just gone.

According to your smartctl command…

Yeah thanks, when I asked I hadn’t noticed that it specified nvme0. At least it’s the older (and smaller) of my NVMEs

From the SMART output as given above:

Error Information Log Entries:      16,834

Wow!

But the other NVME SMART (Samsung SSD 980 PRO 2TB) output has:

Media and Data Integrity Errors:    879

Yeah tbh the system has been shutting down ungracefully from this, and before this I had a bad extension cable causing issues (which, who knows maybe that damaged the nvme controller) so I would expect some incongruities.

I’m looking at the man page for btrfs, particularly removing drives (which seems likely something I’m going to have to do) and because it’s part of essentially a jbod array I’d just like to check how I go about doing it. It says btrfs device remove <device-or-devid> <mountpoint> It’s the mountpoint that I’m not sure about. Would it be / because / is mounted to the volume as a whole, or would it be the partition that is listed in yast as the basis of the btrfs volume? (It says like btrfs on nvme01p3 for all drives in the volume)

So I worked out the syntax to remove the drive (brtfs device remove 1 /) and I really thought it had worked. I ran around a bit in game, stuff was loading as it should. Closed the game… read only. I checked dmesg, and my other nvme controller had “stopped working”.

idk I’ve removed my other nvme drive from the filesystem to test with just SATA drives but honestly I think the probability of both nvme controllers to have stopped working simultaneously while both drives’ actual NAND is fine, it seems unlikely.

First, with the information given, the situation is quite unclear:
You talk about BtrFS and of Windows games; Where on BtrFS are the Windows games, and how do you run them?
Also you say you have a disk and a SSD, making up two BtrFS volumes (as I understand it). However from the output it seems that you have one BtrFS volume containing two SSDs (nvme0n1p3 and nvme1n1p2) and disks (/dev/sda1, /dev/sdc1, /dev/sde1), and the other has /dev/sdb and /dev/sdd.
As you give no mount information and no further detail, it’s all a wild guess, but the setup seems to be quite a mess to me.
Also when showing log messages, it’s important to show the messages where all begins.
“unaligned transfer” seems very wrong, but without further details given, I guess the setup is just a complete mess.
So me recommendation would be to copy all data as long as you can, and set up new filesystems in a clean way, preferrably (if you run Windows, too) separating both OS to use different physical block devices.

I don’t wish to be rude but everything you ask is in the first post.

My Windows games are run via Proton if they are Steam games, and run via Wine in Lutris if they are not. In any case they are on the SSD volume. The point of specifying Windows games is that the issue that ended up being the nvme controller(s) going offline never occurred when the games in question were native Linux games. It only happened when I was using Wine or Proton.

You have misidentified /dev/sda, sdc, and sde, as HDDs. They are not. They are SATA SSDs. You’ll see them identified in the smartctl section.

My SSD volume is mounted to /, and my HDD volume is mounted to /home/trillian/Documents. (I simplified it to saying my entire /home directory was mounted to my HDD in my original post, but this was a bit of an oversimplification. It isn’t really relavent to the issue at hand). My Steam Directory is, I believe /home/trillian/.steam and my Lutris directory is /home/trillian/Games. They are both on the SSD volume.

I have no idea why you have assumed my setup is “a complete mess”. Personally I find it a very elegant way of separating latency critical data and mass storage to their relevant drives without a lot of micro management and without requiring a fully automated tiering and caching system. It’s pretty rude to assert that someone’s setup is “a complete mess” from a position of complete ignorance, fwiw. I say “complete ignorance” because depite there being a lot of information here, you don’t seem to have looked at it before judging.

I don’t know why you think I should copy any data anywhere. I don’t even bother backing up the SSDs – it’s only games that are easy to reinstall on there, and in any case I haven’t had the issue since I removed both nvme drives from the SSD volume. I regularly back up data from my HDDs, but they have not been the subject of this thread.

I also don’t know why you’re assuming I’m running Windows. This is an opensuse forum. Are you perhaps assuming that I am for some reason installing games to a BTRFS partition and then running them from actual Windows? Because I would not have mentioned Proton or Wine if that’s what I was doing. I would probably just keep things simple and stick to an NTFS partition (or drive) if that were the case.

Personally I still think there is more weirdness happening here than the NVME controller failing when hit by a game hammering its I/O. It survived some hefty data transfers in my testing, and if you look devid 1 size 232.38GiB used 0.00B path /dev/nvme0n1p the drive that was reported as causing the issues didn’t even have any data on it. The game was not installed there. Why would the game be hammering this drive specifically? In any case, limiting the SSD volume to SATA drives has fixed the issues, and they’re fast enough.

Regarding this, yeah I copied what I could in the seconds I had before I lost bash and the system rebooted itself. I caught the start of the disk weirdness. I don’t know what “unaligned transfer” means but sr0 is my CD drive. A quick google shows a lot of people asking about this message, and not many answering it. I haven’t noticed any particular issue with actually using my CD drive so I’m content to ignore it. It doesn’t seem to be relevant to this.

Edit: further googling shows that this error is coming from Wine which makes sense given my activities at the time.

An unaligned transfer occurs when data is read from an address that is not evenly divisible by the number of bytes being read.

For example, reading 4 bytes of data from address 0x10004 is fine, but reading 4 bytes of data from address 0x10005 would be an unaligned memory access.

https://docs.kernel.org/core-api/unaligned-memory-access.html

a) sr driver unaligned transfer message has nothing to do with memory access. It means that request start position on the device was not multiple of 512 bytes.
b) sr driver is for SCSI CD-ROM and so has nothing to do with NVMe.