Raid1 array constantly degrading

For some weeks now my Raid1 array is going into degradation roughly every second day. Both HDDs are are affected - sometimes one, sometimes the other. Re-Adding the affected HDD via e.g.

mdadm --re-add /dev/md0 /dev/sdb1

always works out without problems. I can’t find any hint, why this is happening. Extended SMART test does not show errors. Manually checking the Raid does not lead to any clue (at least for me). Please find attached some of the latest logs etc… Any help appreciated + thanks in advance.

journalctl | grep md0

Oct 29 08:36:59 localhost systemd[1]: Started Timer to wait for more drives before activating degraded array md0..
Oct 29 08:37:29 localhost systemd[1]: Starting Activate md array md0 even though degraded...
Oct 29 08:37:29 localhost kernel: md/raid1:md0: active with 1 out of 2 mirrors
Oct 29 08:37:29 localhost kernel: md0: detected capacity change from 0 to 3906756736
Oct 29 08:37:29 localhost systemd[1]: mdadm-last-resort@md0.service: Deactivated successfully.
Oct 29 08:37:29 localhost systemd[1]: Finished Activate md array md0 even though degraded.
Oct 29 08:37:30 localhost systemd[1]: mdadm-last-resort@md0.timer: Deactivated successfully.
Oct 29 08:37:30 localhost systemd[1]: Stopped Timer to wait for more drives before activating degraded array md0..
Oct 29 08:37:30 localhost systemd-fsck[750]: /dev/md0: clean, 993277/122093568 files, 252383210/488344592 blocks
Oct 29 08:37:30 localhost kernel: EXT4-fs (md0): mounted filesystem with ordered data mode. Quota mode: none.
Oct 30 01:00:05 d3417-server.fritz.box root[8341]: mdcheck start checking /dev/md0
Oct 30 01:00:05 d3417-server.fritz.box kernel: md: data-check of RAID array md0
Oct 30 01:00:05 d3417-server.fritz.box kernel: md: md0: data-check done.
Oct 30 01:02:05 d3417-server.fritz.box root[9623]: mdcheck finished checking /dev/md0
Oct 30 07:01:32 d3417-server.fritz.box kernel: md: recovery of RAID array md0
Oct 30 07:10:22 d3417-server.fritz.box kernel: md: md0: recovery done.
Oct 31 07:04:22 d3417-server.fritz.box macosx-prober[21860]: debug: /dev/md0 is not an HFS+ partition: exiting

mdadm --detail /dev/md0

/dev/md0:
           Version : 1.2
     Creation Time : Wed Apr 26 15:14:09 2017
        Raid Level : raid1
        Array Size : 1953378368 (1862.89 GiB 2000.26 GB)
     Used Dev Size : 1953378368 (1862.89 GiB 2000.26 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon Oct 31 06:57:40 2022
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : d3417-server:0
              UUID : 3dd9c7b0:7f330c86:56a22768:18efe755
            Events : 24134

    Number   Major   Minor   RaidDevice State
       3       8       17        0      active sync   /dev/sdb1
       2       8       33        1      active sync   /dev/sdc1

smartctl -a /dev/sdb

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.0.3-1-default] (SUSE RPM)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD20EFRX-68EUZN0
...

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

...

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   174   173   021    Pre-fail  Always       -       4275
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       163
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3510
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       86
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       39
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3587
194 Temperature_Celsius     0x0022   119   104   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3505         -
# 2  Short offline       Completed without error       00%      3481         -
# 3  Short offline       Completed without error       00%      3480         -
# 4  Extended offline    Completed without error       00%      3462         -
# 5  Short offline       Completed without error       00%      3456         -
# 6  Short offline       Completed without error       00%      3432         -
# 7  Short offline       Completed without error       00%      3408         -
# 8  Short offline       Completed without error       00%      3384         -
# 9  Short offline       Completed without error       00%      3360         -
#10  Short offline       Completed without error       00%      3336         -
#11  Short offline       Completed without error       00%      3312         -
#12  Extended offline    Completed without error       00%      3294         -
#13  Short offline       Completed without error       00%      3288         -
#14  Short offline       Completed without error       00%      3264         -
#15  Short offline       Completed without error       00%      3240         -
#16  Short offline       Completed without error       00%      3216         -
#17  Short offline       Completed without error       00%      3192         -
#18  Short offline       Completed without error       00%      3168         -
#19  Short offline       Completed without error       00%      3149         -
#20  Extended offline    Interrupted (host reset)      10%      3126         -
#21  Short offline       Completed without error       00%      3120         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl -a /dev/sdc

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.0.3-1-default] (SUSE RPM)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD20EFRX-68EUZN0
...

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

...

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   169   169   021    Pre-fail  Always       -       4533
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2840
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   072   072   000    Old_age   Always       -       20782
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       102
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       36
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       6180
194 Temperature_Celsius     0x0022   118   111   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     20777         -
# 2  Short offline       Completed without error       00%     20753         -
# 3  Short offline       Completed without error       00%     20752         -
# 4  Extended offline    Completed without error       00%     20734         -
# 5  Short offline       Completed without error       00%     20728         -
# 6  Short offline       Completed without error       00%     20704         -
# 7  Short offline       Completed without error       00%     20680         -
# 8  Short offline       Completed without error       00%     20657         -
# 9  Short offline       Completed without error       00%     20632         -
#10  Short offline       Completed without error       00%     20608         -
#11  Short offline       Completed without error       00%     20584         -
#12  Extended offline    Completed without error       00%     20566         -
#13  Short offline       Completed without error       00%     20560         -
#14  Short offline       Completed without error       00%     20536         -
#15  Short offline       Completed without error       00%     20512         -
#16  Short offline       Completed without error       00%     20488         -
#17  Short offline       Completed without error       00%     20464         -
#18  Short offline       Completed without error       00%     20441         -
#19  Short offline       Completed without error       00%     20421         -
#20  Extended offline    Interrupted (host reset)      10%     20398         -
#21  Short offline       Completed without error       00%     20393         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Nothing worrisome in dmesg or journal either?

You’re not using ancient, red SATA cables are you?

Nothing worrisome in dmesg or journal either?

I provided the latest output of journalctl above? Do you mean something different?

Latest dmesg looks like this:

dmesg -k | grep md0

   35.038425] md/raid1:md0: active with 1 out of 2 mirrors
   35.069232] md0: detected capacity change from 0 to 3906756736
   35.675090] EXT4-fs (md0): mounted filesystem with ordered data mode. Quota mode: none.

Please let me know if any more detailed log would help. I have to admit that my level of knowledge to troubleshoot this are limited in every aspect.

You’re not using ancient, red SATA cables are you?

Well, they are actually not red and I can say that they are not the bulk cables that came with the HDDs. Can’t figure out the make 'cause they were installed 5+ years and ran without any trouble (so far).

Will check dmesg as soon as degradation happens again:

dmesg -k | grep mdadm
dmesg -k | grep sdb
dmesg -k | grep sdc
dmesg -k | grep md0

Anythings else worth looking for?

EDIT: Just saw that it is degraded state now. Will provide addition dmesg logs.

I can’t help myself but I think below is still missing the relevant information >why< it is in degraded state …

mdadm --detail /dev/md0

/dev/md0:
           Version : 1.2
     Creation Time : Wed Apr 26 15:14:09 2017
        Raid Level : raid1
        Array Size : 1953378368 (1862.89 GiB 2000.26 GB)
     Used Dev Size : 1953378368 (1862.89 GiB 2000.26 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon Oct 31 07:33:46 2022
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : d3417-server:0
              UUID : 3dd9c7b0:7f330c86:56a22768:18efe755
            Events : 24136

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       2       8       33        1      active sync   /dev/sdc1

dmesg -k | grep md0

   35.038425] md/raid1:md0: active with 1 out of 2 mirrors
   35.069232] md0: detected capacity change from 0 to 3906756736
   35.675090] EXT4-fs (md0): mounted filesystem with ordered data mode. Quota mode: none.

dmesg -k | grep sdb

    0.851102] sd 1:0:0:0: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
    0.851105] sd 1:0:0:0: [sdb] 4096-byte physical blocks
    0.851118] sd 1:0:0:0: [sdb] Write Protect is off
    0.851121] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
    0.851141] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    0.851689] sd 1:0:0:0: [sdb] Preferred minimum I/O size 4096 bytes
    0.915398]  sdb: sdb1
    0.915556] sd 1:0:0:0: [sdb] Attached SCSI disk

dmesg -k | grep sdc

    0.851253] sd 2:0:0:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
    0.851259] sd 2:0:0:0: [sdc] 4096-byte physical blocks
    0.851271] sd 2:0:0:0: [sdc] Write Protect is off
    0.851275] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
    0.851295] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    0.851323] sd 2:0:0:0: [sdc] Preferred minimum I/O size 4096 bytes
    0.915264]  sdc: sdc1
    0.915445] sd 2:0:0:0: [sdc] Attached SCSI disk

journalctl | grep md0 (since lated boot with degraded array)

Oct 31 07:33:15 localhost systemd[1]: Started Timer to wait for more drives before activating degraded array md0…
Oct 31 07:33:45 localhost systemd[1]: Starting Activate md array md0 even though degraded…
Oct 31 07:33:46 localhost kernel: md/raid1:md0: active with 1 out of 2 mirrors
Oct 31 07:33:46 localhost kernel: md0: detected capacity change from 0 to 3906756736
Oct 31 07:33:46 localhost systemd[1]: mdadm-last-resort@md0.service: Deactivated successfully.
Oct 31 07:33:46 localhost systemd[1]: Finished Activate md array md0 even though degraded.
Oct 31 07:33:46 localhost systemd[1]: mdadm-last-resort@md0.timer: Deactivated successfully.
Oct 31 07:33:46 localhost systemd[1]: Stopped Timer to wait for more drives before activating degraded array md0…
Oct 31 07:33:46 localhost systemd-fsck[753]: /dev/md0: clean, 1015079/122093568 files, 252412385/488344592 blocks
Oct 31 07:33:46 localhost kernel: EXT4-fs (md0): mounted filesystem with ordered data mode. Quota mode: none.
Oct 31 07:33:59 d3417-server.fritz.box macosx-prober[4803]: debug: /dev/md0 is not an HFS+ partition: exiting

So far you have limited your journal search for RAID or disk specific errors.

A search for errors in general like

journalctl -p 3

could show other errors which nevertheless might be related to your problem.

Regards

susejunky

Thanks for the suggestion. Find below the latest two boot processes … there was at least one degradation in between. I know about this “firmware bug” messages due to an older server board - they have been there ever since I switched to Tumbleweed. So I think this problem is not related to the Raid issue.

journalctl -p 3

-- Boot 98486b3aadfa445b980b1e5da4f001b5 --
Oct 29 08:36:56 localhost kernel: DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR [0x00000000cd000000-0x00000000cf7fffff], contact BIOS vendor for fixes
Oct 29 08:36:56 localhost kernel: x86/cpu: SGX disabled by BIOS.
Oct 29 08:36:59 localhost kernel: ACPI BIOS Error (bug): Could not resolve symbol \_SB.PCI0.LPCB.HEC.ECAV], AE_NOT_FOUND (20220331/psargs-330)
Oct 29 08:36:59 localhost kernel: ACPI Error: Aborting method \_TZ.TZ00._TMP due to previous error (AE_NOT_FOUND) (20220331/psparse-529)
Oct 29 08:36:59 localhost kernel: ACPI BIOS Error (bug): Could not resolve symbol \_SB.PCI0.LPCB.HEC.ECAV], AE_NOT_FOUND (20220331/psargs-330)
Oct 29 08:36:59 localhost kernel: ACPI Error: Aborting method \_TZ.TZ00._TMP due to previous error (AE_NOT_FOUND) (20220331/psparse-529)
Oct 29 08:36:59 localhost kernel: ACPI BIOS Error (bug): Could not resolve symbol \_SB.PCI0.LPCB.HEC.ECAV], AE_NOT_FOUND (20220331/psargs-330)
Oct 29 08:36:59 localhost kernel: ACPI Error: Aborting method \_TZ.TZ01._TMP due to previous error (AE_NOT_FOUND) (20220331/psparse-529)
Oct 29 08:36:59 localhost kernel: ACPI BIOS Error (bug): Could not resolve symbol \_SB.PCI0.LPCB.HEC.ECAV], AE_NOT_FOUND (20220331/psargs-330)
Oct 29 08:37:00 localhost kernel: ACPI Error: Aborting method \_TZ.TZ01._TMP due to previous error (AE_NOT_FOUND) (20220331/psparse-529)
Oct 29 08:37:30 localhost /usr/sbin/irqbalance[779]: thermal: socket bind failed.
Oct 29 08:37:35 d3417-server.fritz.box smbd[2296]: [2022/10/29 08:37:35.311526,  0] ../../source3/smbd/server.c:1741(main)
Oct 29 08:37:35 d3417-server.fritz.box smbd[2296]:   smbd version 4.17.2-git.273.a55a83528b9SUSE-oS15.9-x86_64 started.
Oct 29 08:37:35 d3417-server.fritz.box smbd[2296]:   Copyright Andrew Tridgell and the Samba Team 1992-2022
Oct 29 16:11:56 d3417-server.fritz.box kernel: audit: backlog limit exceeded
Oct 29 16:11:56 d3417-server.fritz.box kernel: audit: backlog limit exceeded
Oct 29 16:11:56 d3417-server.fritz.box kernel: audit: backlog limit exceeded
-- Boot b43140aaf75c46da9e5288f678273ec6 --
Oct 31 07:33:12 localhost kernel: DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR [0x00000000cd000000-0x00000000cf7fffff], contact BIOS vendor for fixes
Oct 31 07:33:12 localhost kernel: x86/cpu: SGX disabled by BIOS.
Oct 31 07:33:15 localhost kernel: ACPI BIOS Error (bug): Could not resolve symbol \_SB.PCI0.LPCB.HEC.ECAV], AE_NOT_FOUND (20220331/psargs-330)
Oct 31 07:33:15 localhost kernel: ACPI Error: Aborting method \_TZ.TZ00._TMP due to previous error (AE_NOT_FOUND) (20220331/psparse-529)
Oct 31 07:33:16 localhost kernel: ACPI BIOS Error (bug): Could not resolve symbol \_SB.PCI0.LPCB.HEC.ECAV], AE_NOT_FOUND (20220331/psargs-330)
Oct 31 07:33:16 localhost kernel: ACPI Error: Aborting method \_TZ.TZ00._TMP due to previous error (AE_NOT_FOUND) (20220331/psparse-529)
Oct 31 07:33:16 localhost kernel: ACPI BIOS Error (bug): Could not resolve symbol \_SB.PCI0.LPCB.HEC.ECAV], AE_NOT_FOUND (20220331/psargs-330)
Oct 31 07:33:16 localhost kernel: ACPI Error: Aborting method \_TZ.TZ01._TMP due to previous error (AE_NOT_FOUND) (20220331/psparse-529)
Oct 31 07:33:16 localhost kernel: ACPI BIOS Error (bug): Could not resolve symbol \_SB.PCI0.LPCB.HEC.ECAV], AE_NOT_FOUND (20220331/psargs-330)
Oct 31 07:33:16 localhost kernel: ACPI Error: Aborting method \_TZ.TZ01._TMP due to previous error (AE_NOT_FOUND) (20220331/psparse-529)
Oct 31 07:33:47 localhost /usr/sbin/irqbalance[784]: thermal: socket bind failed.
Oct 31 07:33:52 d3417-server.fritz.box smbd[2589]: [2022/10/31 07:33:52.049579,  0] ../../source3/smbd/server.c:1741(main)
Oct 31 07:33:52 d3417-server.fritz.box smbd[2589]:   smbd version 4.17.2-git.273.a55a83528b9SUSE-oS15.9-x86_64 started.
Oct 31 07:33:52 d3417-server.fritz.box smbd[2589]:   Copyright Andrew Tridgell and the Samba Team 1992-2022
Oct 31 08:36:22 d3417-server.fritz.box sshd[11797]: pam_systemd(sshd:session): Failed to release session: Interrupted system call
Oct 31 08:52:42 d3417-server.fritz.box sshd[26324]: pam_systemd(sshd:session): Failed to release session: Interrupted system call

This looks like auditd is active. Does it log any “suspicious” events? Why is it running out of backlog buffers?

Sorry for me being nonspecific but I only know that auditd exists and have only a vague understanding of what it does.

Regards

susejunky

This looks like auditd is active. Does it log any “suspicious” events? Why is it running out of backlog buffers?

I have to admit that I was not even aware of this daemon. I checked the log as good as I could for “raid”, “ata”, “md0”, “sdb”, “sdc” but it came up with nothing that would relate to the Raid. I can’t explain why it is complaining about backlog buffer. There are older logs present (audit.log.1…4) - maybe this is just a friendly reminder of that. All in all there is plenty of disc space present.

Regarding linux I am on a level of “dangerous half-knowledge”. Normally I am used to get more information than I can digest (or even understand). I consider this situation as pretty critical (backup done is all I can do atm). Really wondering why linux is so silent about the reason for degradation. That said … again, dangerous half-knowledge on my side.

Would there be a better place to place this issue? Any mdadm related place?

Thanks again.

I’m not sure whether it is “a better place” but you could try the openSUSE support mailing list (support@lists.opensuse.org).

In case you are interested in auditd here you can find the openSUSE Leap 15.4 documentation (which probably will apply for Tumbleweed as well).

Regards

susejunky

Thanks for the suggestion. I will give that a try. Even if I am still a little bit hesitant 'cause this might be a very basic or self-imposed issue that I am just not able to figure out.

For the sake of completeness: This is the mail I get informing about the degradation - also without any detailed information.

This is an automatically generated mail message from mdadm
running on d3417-server.fritz.box

A DegradedArray event had been detected on md device /dev/md/d3417-server:0.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] 
md0 : active raid1 sdc1[2]
      1953378368 blocks super 1.2 [2/1] [_U]
      bitmap: 15/15 pages [60KB], 65536KB chunk

unused devices: <none>

Grepping purely for md0 or sata provides no context. A preceding or following line or lines could be pointing to an underlying problem.

Well, they are actually not red and I can say that they are not the bulk cables that came with the HDDs. Can’t figure out the make 'cause they were installed 5+ years and ran without any trouble (so far).
There was a particular shade of red dye used in many old cables that caused accelerated corrosion of the cable’s conductors, but corrosion and thus degradation does not depend on red dye. Does a cable change help?

Grepping purely for md0 or sata provides no context. A preceding or following line or lines could be pointing to an underlying problem.

OK, understood. I repeated everything via ‘grep -A 10’ but could not find any suspicious hint. (Just realized that I ignored your hint regarding preceding lines.)

Does a cable change help?

To be honest I have not tried so far 'cause I have no quality spare cables. But I see that this has to be sorted out first. But even defective cables should throw some kind of read/write error, right?

Will get some cables ASAP and report.

Just some additional information: This Raid is just for seldom used bulk data and most of the time idle. Daily used just for intermediate backup. It just came to me that degradation never seems to happen while there is actual workload - e.g. during intermediate backup or regular backup.

Cables are “SATA III SilverStone CP07 180°”. Now ordered “Lindy 33324”. Red, but as far as I understand Lindy has some reputation for cables (might be wrong here).

You may consider showing comprehensive output from kernel scsi subsystem:

[FONT=monospace]**Leap-15-4:~ #** journalctl -b0 _KERNEL_SUBSYSTEM=scsi 
Nov 01 08:15:36 Leap-15-4 kernel: scsi host0: ahci 
Nov 01 08:15:36 Leap-15-4 kernel: scsi host1: ahci 
Nov 01 08:15:36 Leap-15-4 kernel: scsi host2: ahci 
Nov 01 08:15:36 Leap-15-4 kernel: scsi host3: ahci 
Nov 01 08:15:36 Leap-15-4 kernel: scsi host4: ahci 
Nov 01 08:15:36 Leap-15-4 kernel: scsi host5: ahci 
Nov 01 08:15:36 Leap-15-4 kernel: scsi host6: ahci 
Nov 01 08:15:36 Leap-15-4 kernel: scsi host7: ahci 
Nov 01 08:15:37 Leap-15-4 kernel: **scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD 850  2B6Q PQ: 0 ANSI: 5**
Nov 01 08:15:37 Leap-15-4 kernel: **scsi 0:0:0:0: Attached scsi generic sg0 type 0**
Nov 01 08:15:37 Leap-15-4 kernel: **scsi 1:0:0:0: Direct-Access     ATA      Samsung SSD 850  3B6Q PQ: 0 ANSI: 5**
Nov 01 08:15:37 Leap-15-4 kernel: **scsi 1:0:0:0: Attached scsi generic sg1 type 0**
Nov 01 08:15:38 Leap-15-4 kernel: **scsi 4:0:0:0: Direct-Access     ATA      ST2000DM001-1CH1 CC29 PQ: 0 ANSI: 5**
Nov 01 08:15:38 Leap-15-4 kernel: **scsi 4:0:0:0: Attached scsi generic sg2 type 0**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 4:0:0:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 4:0:0:0: [sdc] 4096-byte physical blocks**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 1:0:0:0: [sdb] 976773168 512-byte logical blocks: (500 GB/466 GiB)**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 4:0:0:0: [sdc] Write Protect is off**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 1:0:0:0: [sdb] Write Protect is off**
Nov 01 08:15:38 Leap-15-4 kernel: sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Nov 01 08:15:38 Leap-15-4 kernel: sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Nov 01 08:15:38 Leap-15-4 kernel: **sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 0:0:0:0: [sda] 488397168 512-byte logical blocks: (250 GB/233 GiB)**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 0:0:0:0: [sda] Write Protect is off**
Nov 01 08:15:38 Leap-15-4 kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Nov 01 08:15:38 Leap-15-4 kernel: **sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 0:0:0:0: [sda] supports TCG Opal**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 0:0:0:0: [sda] Attached SCSI disk**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 1:0:0:0: [sdb] supports TCG Opal**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 1:0:0:0: [sdb] Attached SCSI disk**
Nov 01 08:15:38 Leap-15-4 kernel: **sd 4:0:0:0: [sdc] Attached SCSI disk**
**Leap-15-4:~ #**[/FONT]

Please find the log you suggested below. I even checked the more previous boots processes … they all look the same.

journalctl -b -0 _KERNEL_SUBSYSTEM=scsi

Nov 01 07:06:19 localhost kernel: scsi host0: ahci
Nov 01 07:06:19 localhost kernel: scsi host1: ahci
Nov 01 07:06:19 localhost kernel: scsi host2: ahci
Nov 01 07:06:19 localhost kernel: scsi host3: ahci
Nov 01 07:06:19 localhost kernel: scsi host4: ahci
Nov 01 07:06:19 localhost kernel: scsi host5: ahci
Nov 01 07:06:19 localhost kernel: scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD 870  1B6Q PQ: 0 ANSI: 5
Nov 01 07:06:19 localhost kernel: scsi 1:0:0:0: Direct-Access     ATA      WDC WD20EFRX-68E 0A82 PQ: 0 ANSI: 5
Nov 01 07:06:19 localhost kernel: sd 0:0:0:0: [sda] 976773168 512-byte logical blocks: (500 GB/466 GiB)
Nov 01 07:06:19 localhost kernel: sd 0:0:0:0: [sda] Write Protect is off
Nov 01 07:06:19 localhost kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Nov 01 07:06:19 localhost kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 01 07:06:19 localhost kernel: sd 0:0:0:0: [sda] Preferred minimum I/O size 512 bytes
Nov 01 07:06:19 localhost kernel: sd 1:0:0:0: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
Nov 01 07:06:19 localhost kernel: sd 1:0:0:0: [sdb] 4096-byte physical blocks
Nov 01 07:06:19 localhost kernel: scsi 2:0:0:0: Direct-Access     ATA      WDC WD20EFRX-68E 0A82 PQ: 0 ANSI: 5
Nov 01 07:06:19 localhost kernel: sd 1:0:0:0: [sdb] Write Protect is off
Nov 01 07:06:19 localhost kernel: sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Nov 01 07:06:19 localhost kernel: sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 01 07:06:19 localhost kernel: sd 1:0:0:0: [sdb] Preferred minimum I/O size 4096 bytes
Nov 01 07:06:19 localhost kernel: sd 2:0:0:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
Nov 01 07:06:19 localhost kernel: sd 2:0:0:0: [sdc] 4096-byte physical blocks
Nov 01 07:06:19 localhost kernel: scsi 3:0:0:0: Direct-Access     ATA      CT500MX500SSD1   023  PQ: 0 ANSI: 5
Nov 01 07:06:19 localhost kernel: sd 2:0:0:0: [sdc] Write Protect is off
Nov 01 07:06:19 localhost kernel: sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Nov 01 07:06:19 localhost kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 01 07:06:19 localhost kernel: sd 2:0:0:0: [sdc] Preferred minimum I/O size 4096 bytes
Nov 01 07:06:19 localhost kernel: sd 3:0:0:0: [sdd] 976773168 512-byte logical blocks: (500 GB/466 GiB)
Nov 01 07:06:19 localhost kernel: sd 3:0:0:0: [sdd] 4096-byte physical blocks
Nov 01 07:06:19 localhost kernel: sd 3:0:0:0: [sdd] Write Protect is off
Nov 01 07:06:19 localhost kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Nov 01 07:06:19 localhost kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 01 07:06:19 localhost kernel: sd 3:0:0:0: [sdd] Preferred minimum I/O size 4096 bytes
Nov 01 07:06:19 localhost kernel: sd 0:0:0:0: [sda] supports TCG Opal
Nov 01 07:06:19 localhost kernel: sd 0:0:0:0: [sda] Attached SCSI disk
Nov 01 07:06:19 localhost kernel: sd 3:0:0:0: [sdd] supports TCG Opal
Nov 01 07:06:19 localhost kernel: sd 3:0:0:0: [sdd] Attached SCSI disk
Nov 01 07:06:19 localhost kernel: sd 1:0:0:0: [sdb] Attached SCSI disk
Nov 01 07:06:19 localhost kernel: sd 2:0:0:0: [sdc] Attached SCSI disk
Nov 01 07:06:19 localhost kernel: sd 0:0:0:0: Attached scsi generic sg0 type 0
Nov 01 07:06:19 localhost kernel: sd 1:0:0:0: Attached scsi generic sg1 type 0
Nov 01 07:06:19 localhost kernel: sd 2:0:0:0: Attached scsi generic sg2 type 0
Nov 01 07:06:19 localhost kernel: sd 3:0:0:0: Attached scsi generic sg3 type 0


The above journal is what one expects for a sound system. You may perform a raw read of the drives (dd if=/dev/sdb of=/dev/null bs=4M) and monitor the output of “journalctl -f _KERNEL_SUBSYSTEM=scsi”.

Spurious errors are hard to track down: Weihnachtsbescherung | Karl Mistelberger The drive failing eight years ago still worked in 2022. Thorough cleaning of the connectors is always worthwhile.

Do the drives spin down when they are idle for a certain time span?

All my NAS-drives are configured to never spin down. So I can’t tell what happens if the drives of a RAID don’t spin down/wake up synchronously.

Regards

susejunky

@karlmistelberger

OK. Will perform this test and reply. Vielen Dank!

EDIT:

sdb and sdc are both parts of the active Raid. Would dd if=/dev/sdb of=/dev/null bs=4M be advised even if the Raid is active? Any other precautions?

@susejunky

Nope, I don’t let them spin down. Did that for some time but sometimes got fishy results with one HDD spinning and one not. Also HDDs regularly spun up for no reason (well, no reason that I was able to figure out).

Raw read access increases stress on the drive. But it won’t be destructive.

Tried dd for both drives - but no log at all.

Waiting for the new cables. Will reply after changing the cables.