Error Currently unreadable (pending) sectors in a raid

I have found this error


#cat /var/log/messages
......
2019-09-04T00:02:10.840374+02:00 aldebaran smartd[1050]: Device: /dev/sda [SAT], 5 Currently unreadable (pending) sectors
2019-09-04T00:02:10.881906+02:00 aldebaran smartd[1050]: Device: /dev/sda [SAT], 5 Offline uncorrectable sectors

......


I have run smartctl and …


aldebaran:/home/fernando # smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.13-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD10EZEX-00RKKA0
Serial Number:    WD-WCC1S5715299
LU WWN Device Id: 5 0014ee 208fa863f
Firmware Version: 80.00A80
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Sep  3 23:14:03 2019 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 119) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline 
data collection:                (10740) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 124) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x30b5) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       801
  3 Spin_Up_Time            0x0027   175   172   021    Pre-fail  Always       -       2241
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       143
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   034   034   000    Old_age   Always       -       48822
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       141
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       96
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       46
194 Temperature_Celsius     0x0022   097   076   000    Old_age   Always       -       46
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       5
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       5
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       70%     48822         1953520936
# 2  Short offline       Completed: read failure       70%     48803         1953520936
# 3  Short offline       Completed: read failure       70%     48779         1953520936
# 4  Extended offline    Completed: read failure       10%     48755         1953520936
# 5  Short offline       Completed: read failure       70%     48731         1953520936
# 6  Short offline       Completed: read failure       70%     48707         1953520936
# 7  Short offline       Completed: read failure       10%     48683         1953520936
# 8  Extended offline    Completed: read failure       10%     48086         1953520936
# 9  Extended offline    Completed: read failure       10%     47414         1953520936
#10  Extended offline    Completed: read failure       10%     46576         1953520936
#11  Extended offline    Completed: read failure       10%     45905         1953520936
#12  Extended offline    Completed: read failure       10%     45234         1953520936
#13  Extended offline    Completed: read failure       10%     44395         1953520936
#14  Extended offline    Completed: read failure       10%     43727         1953520936
#15  Extended offline    Completed: read failure       10%     43053         1953520936
#16  Extended offline    Completed: read failure       10%     42214         1953520936
#17  Extended offline    Completed: read failure       10%     41543         1953520936
#18  Extended offline    Completed: read failure       10%     40871         1953520936
#19  Extended offline    Completed: read failure       10%     40031         1953520936
#20  Extended offline    Completed: read failure       10%     39360         1953520936
#21  Extended offline    Completed: read failure       10%     38521         1953520936

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



it seems there is a error at sector 1953520936 of sda.
this is inside (almost at the end of it) sda3


aldebaran:/home/fernando # fdisk -l               
Disk /dev/sda: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: WDC WD10EZEX-00R
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x301cad44

Device     Boot     Start        End    Sectors   Size Id Type
/dev/sda1            2048   33556479   33554432    16G 82 Linux swap / Solaris
/dev/sda2  *     33556480  243271679  209715200   100G fd Linux raid autodetect
/dev/sda3       243271680 1953525167 1710253488 815.5G fd Linux raid autodetect


Disk /dev/sdb: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: WDC WD10EZEX-00K
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x301cad50

Device     Boot     Start        End    Sectors   Size Id Type
/dev/sdb1            2048   33556479   33554432    16G 82 Linux swap / Solaris
/dev/sdb2  *     33556480  243271679  209715200   100G fd Linux raid autodetect
/dev/sdb3       243271680 1953525167 1710253488 815.5G fd Linux raid autodetect


sda3 is part of a raid1 /dev/md1


aldebaran:/home/fernando # mdadm --detail /dev/md1
/dev/md1:
           Version : 1.0
     Creation Time : Thu Jan 16 20:42:01 2014
        Raid Level : raid1
        Array Size : 855126592 (815.51 GiB 875.65 GB)
     Used Dev Size : 855126592 (815.51 GiB 875.65 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

       Update Time : Tue Sep  3 23:49:35 2019
             State : clean, degraded 
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : sysresccd:1
              UUID : efd16cf8:9a06b080:76cffa9f:5cd051d0
            Events : 13938543

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       19        1      active sync   /dev/sdb3
aldebaran:/home/fernando # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1] 
md1 : active raid1 sdb3[1]
      855126592 blocks super 1.0 [2/1] [_U]



And I can see the raid is not using sda3 anymore.

So, the question is what now?

  1. Replace the disk sda and rebuild the array

  2. Fix the damaged disk with something like this And then rebuild the array

  3. Just try to rebuild the array without any previous work

4 ) other???

best regards

Seeing that many hours and errors like you have on a WD Blue I would waste no time replacing both. Blue is bottom of the line, not recommended for RAID. Specifics of why plain desktop drives are recommended against for RAID I have no recollection. I’ve switched my RAIDs to better (Hitachi) drives since that enlightenment.

At some point I would do a long test to be sure about that far end of the drive, but probably not until after WD Blue replacement.

Yesterday I let them both passing a long test.
This is the output for sda


aldebaran:/home/fernando # smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.13-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD10EZEX-00RKKA0
Serial Number:    WD-WCC1S5715299
LU WWN Device Id: 5 0014ee 208fa863f
Firmware Version: 80.00A80
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Sep  4 09:24:45 2019 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 113) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline 
data collection:                (10740) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 124) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x30b5) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       801
  3 Spin_Up_Time            0x0027   175   172   021    Pre-fail  Always       -       2241
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       143
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   034   034   000    Old_age   Always       -       48832
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       141
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       96
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       46
194 Temperature_Celsius     0x0022   102   076   000    Old_age   Always       -       41
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       5
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       5
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%     48826         1953520936
# 2  Short offline       Completed: read failure       70%     48822         1953520936
# 3  Short offline       Completed: read failure       70%     48803         1953520936
# 4  Short offline       Completed: read failure       70%     48779         1953520936
# 5  Extended offline    Completed: read failure       10%     48755         1953520936
# 6  Short offline       Completed: read failure       70%     48731         1953520936
# 7  Short offline       Completed: read failure       70%     48707         1953520936
# 8  Short offline       Completed: read failure       10%     48683         1953520936
# 9  Extended offline    Completed: read failure       10%     48086         1953520936
#10  Extended offline    Completed: read failure       10%     47414         1953520936
#11  Extended offline    Completed: read failure       10%     46576         1953520936
#12  Extended offline    Completed: read failure       10%     45905         1953520936
#13  Extended offline    Completed: read failure       10%     45234         1953520936
#14  Extended offline    Completed: read failure       10%     44395         1953520936
#15  Extended offline    Completed: read failure       10%     43727         1953520936
#16  Extended offline    Completed: read failure       10%     43053         1953520936
#17  Extended offline    Completed: read failure       10%     42214         1953520936
#18  Extended offline    Completed: read failure       10%     41543         1953520936
#19  Extended offline    Completed: read failure       10%     40871         1953520936
#20  Extended offline    Completed: read failure       10%     40031         1953520936
#21  Extended offline    Completed: read failure       10%     39360         1953520936

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



An this one the output for sdb


aldebaran:/home/fernando # smartctl -a /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.13-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD10EZEX-00KUWA0
Serial Number:    WD-WCC1S5888309
LU WWN Device Id: 5 0014ee 25e597293
Firmware Version: 15.01H15
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Sep  4 09:26:39 2019 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (10800) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 118) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x30b5) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   174   173   021    Pre-fail  Always       -       2291
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       141
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       5
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   034   034   000    Old_age   Always       -       48829
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       140
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       95
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       45
194 Temperature_Celsius     0x0022   103   077   000    Old_age   Always       -       40
196 Reallocated_Event_Count 0x0032   198   198   000    Old_age   Always       -       2
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     48823         -
# 2  Extended offline    Completed without error       00%     48822         -
# 3  Short offline       Completed without error       00%     48819         -
# 4  Short offline       Completed without error       00%     48800         -
# 5  Short offline       Completed without error       00%     48776         -
# 6  Extended offline    Completed without error       00%     48752         -
# 7  Short offline       Completed without error       00%     48728         -
# 8  Short offline       Completed without error       00%     48704         -
# 9  Short offline       Completed without error       00%     48680         -
#10  Extended offline    Completed without error       00%     48083         -
#11  Extended offline    Completed without error       00%     47411         -
#12  Extended offline    Completed without error       00%     46572         -
#13  Extended offline    Interrupted (host reset)      10%     45903         -
#14  Extended offline    Completed without error       00%     45231         -
#15  Extended offline    Completed without error       00%     44392         -
#16  Extended offline    Completed without error       00%     43721         -
#17  Extended offline    Completed without error       00%     43050         -
#18  Extended offline    Completed without error       00%     42211         -
#19  Extended offline    Completed without error       00%     41540         -
#20  Extended offline    Completed without error       00%     40867         -
#21  Extended offline    Completed without error       00%     40028         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



So, you suggest replacing them both although sdb passed the test?

best regards

You have 5.5 years of power on hours on those desktop, bottom of the line drives, which are recommended against in RAID environments. sdb does show a non-zero Reallocated_Event_Count, while sda remains intact. Whether those two events are recent or not I don’t know, but that it isn’t zero might amount to a recommendation for administrative action sooner rather than later.