Degraded Array

The system is sending me this email:

A DegradedArray event had been detected on md device /dev/md/Volume1_0.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1]
md126 : active raid1 sda[0]
1953511424 blocks super external:/md127/0 [2/1] [U_]

md127 : inactive sda0
3160 blocks super external:imsm

unused devices: <none>

I assume this means that one of the disks in the RAID is failing. If this is correct, I would love some help in knowing how to proceed.
Background: this is a new system (about 4 months in service), set up with four disks: two SSDs with Windows and OpenSuse 15.1 respectively (the operating systems), and two hard drive for the /home directory, set up with mdadm RAID1. (Patches on the OS are up to date.) From what I read in the man page of mdadm, this is where the issue lies. It sounds like I need to “fail” the defective disk, physically remove it, get the warranty replacement, then reinstall the disk and rebuild the RAID.
If this is all correct, then my first question is how do I know which physical disk is the one that is bad? What would the exact command look like to “fail” the disk? Is there anything else I need to do before removing the disk from the computer?

Here is the output of “ll /dev/md*”:
brw-rw---- 1 root disk 9, 126 Apr 16 18:43 /dev/md126
brw-rw---- 1 root disk 259, 9 Apr 16 18:43 /dev/md126p1
brw-rw---- 1 root disk 259, 10 Apr 16 18:43 /dev/md126p2
brw-rw---- 1 root disk 9, 127 Apr 16 18:43 /dev/md127

/dev/md:
total 0
lrwxrwxrwx 1 root root 8 Apr 16 18:43 imsm0 → …/md127
lrwxrwxrwx 1 root root 8 Apr 16 18:43 Volume1_0 → …/md126
lrwxrwxrwx 1 root root 10 Apr 16 18:43 Volume1_0p1 → …/md126p1
lrwxrwxrwx 1 root root 10 Apr 16 18:43 Volume1_0p2 → …/md126p2

Here is output of “mdadm --detail”
/dev/md126:
Container : /dev/md/imsm0, member 0
Raid Level : raid1
Array Size : 1953511424 (1863.01 GiB 2000.40 GB)
Used Dev Size : 1953511424 (1863.01 GiB 2000.40 GB)
Raid Devices : 2
Total Devices : 1

         State : active, degraded 
Active Devices : 1

Working Devices : 1
Failed Devices : 0
Spare Devices : 0

Consistency Policy : resync

          UUID : 24021f90:94ce1c7b:2e66a0b1:371dad4c
Number   Major   Minor   RaidDevice State
   0       8        0        0      active sync   /dev/sda
   -       0        0        1      removed

Probably there is something else needed: please let me know. FWIW, I am a long-time OpenSuse end user, not afraid of CLI, but also maintaining only my own machines, so I don’t do a lot of under the hood stuff. Very little experience with software RAID (previous machine was hardware RAID). Help is appreciated!

(Fr) David Ousley

New 2TB HDDs - probably they have SMR => they are incompatible with RAID.
Or you will need extra settings to deal with them.
More at
https://blocksandfiles.com/2020/04/14/wd-red-nas-drives-shingled-magnetic-recording/
https://blocksandfiles.com/2020/04/15/seagate-2-4-and-8tb-barracuda-and-desktop-hdd-smr/
https://blocksandfiles.com/2020/04/16/toshiba-desktop-disk-drives-undocumented-shingle-magnetic-recording/
https://blocksandfiles.com/2020/04/15/shingled-drives-have-non-shingled-zones-for-caching-writes/
https://blocksandfiles.com/2020/04/20/western-digital-smr-drives-statement/

https://www.tomshardware.com/news/wd-fesses-up-some-red-hdds-use-slow-smr-tech

[QUOTE=Svyatko;2933733]New 2TB HDDs - probably they have SMR => they are incompatible with RAID.

The DegradedArray messages did not appear until after several months. If this is the issue, would I have gotten the errors from the beginning of the RAID setup?

Thanks.

Your post is unreadable. It is impossible to understand where is e-mail you mention, where is computer output and where are your comments - all mixed up. There is code tag for showing computer output, there is quote tag for visually marking some external text.

The only clear thing is that your raid array lost one member disk. Did you check whether all disks are physically present and detected by kernel? Are there any errors in kernel log?

Apologies for the lack of clarity.

Kernel log shows nothing. Disks are present: and worked for several months before the error messages began.

Hopefully this will be clearer:

**sjbserver:/proc #** tail mdstat
Personalities : [raid1]  
md126 : active raid1 sda[0]
      1953511424 blocks super external:/md127/0 [2/1] [U_]
       
md127 : inactive sda[0](S)
      3160 blocks super external:imsm
        
unused devices: <none>


Daily mail from root says

"A DegradedArray event had been detected on md device /dev/md/Volume1_0."

This is what I’m asking about. These messages began several months after the machine began its use. Two 2TB hard drives were set up with software RAID1 for /home partition. The OS (Opensuse 15.1) is on an SSD, not part of the RAID.

**sjbserver:~ #** mdadm --detail /dev/md126
/dev/md126:
         Container : /dev/md/imsm0, member 0
        Raid Level : raid1
        Array Size : 1953511424 (1863.01 GiB 2000.40 GB)
     Used Dev Size : 1953511424 (1863.01 GiB 2000.40 GB)
      Raid Devices : 2
     Total Devices : 1

             State : active, degraded  
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync


              UUID : 24021f90:94ce1c7b:2e66a0b1:371dad4c
    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       -       0        0        1      removed
**sjbserver:~ #** 

**sjbserver:~ #** ll /dev/md*
brw-rw---- 1 root disk   9, 126 Apr 16 18:43 **/dev/md126**
brw-rw---- 1 root disk 259,   9 Apr 16 18:43 **/dev/md126p1**
brw-rw---- 1 root disk 259,  10 Apr 16 18:43 **/dev/md126p2**
brw-rw---- 1 root disk   9, 127 Apr 16 18:43 **/dev/md127**

/dev/md:
total 0
lrwxrwxrwx 1 root root  8 Apr 16 18:43 imsm0 -> **../md127**
lrwxrwxrwx 1 root root  8 Apr 16 18:43 Volume1_0 -> **../md126**
lrwxrwxrwx 1 root root 10 Apr 16 18:43 Volume1_0p1 -> **../md126p1**
lrwxrwxrwx 1 root root 10 Apr 16 18:43 Volume1_0p2 -> **../md126p2**
**sjbserver:~ #** 

**sjbserver:~ #** mdadm --detail /dev/md127
/dev/md127:
           Version : imsm
        Raid Level : container
     Total Devices : 1

   Working Devices : 1


              UUID : 5720a588:186d1d2c:088a160e:4464a97f
     Member Arrays : /dev/md/Volume1_0

    Number   Major   Minor   RaidDevice

       -       8        0        -        /dev/sda
**sjbserver:~ #** 


You did not show any evidence that it is true.

**sjbserver:~ #** mdadm --detail /dev/md127
/dev/md127:
           Version : imsm
        Raid Level : container
     Total Devices : 1

Working Devices : 1

          UUID : 5720a588:186d1d2c:088a160e:4464a97f
 Member Arrays : /dev/md/Volume1_0
Number   Major   Minor   RaidDevice
   -       8        0        -        /dev/sda

sjbserver:~ #

This is Intel fake raid on your motherboard and it clearly states it has only one disk member.

Output of

mdadm --detail-platform
mdadm --examine --verbose /dev/sda

may give some more hints. If you really have other disks, repeat “mdadm --examine” for each of them.

I see this sometimes with cheap SATA-to-USB adapters / SATA-cables used for building NAS with RAID 1 or with unstable power supply for the HDDs/SSDs.

How about some more details on the hardware you are using? If the 2TB HDDs are really only few months old, the chance for a hardware issue with the HDDs is rather low.

Do you get the /dev/sdb member of the RAID back after a reboot?

What is the output of

smartctl -a /dev/sdb

at this point?

Drives are Seagate BarraCuda ST2000DMA08.

After a second reboot, and making sure the cables are secure in the hard drives, it appears to be rebuilding:

**sjbserver:~ #** mdadm --detail /dev/md126
/dev/md126:
         Container : /dev/md/imsm0, member 0
        Raid Level : raid1
        Array Size : 1953511424 (1863.01 GiB 2000.40 GB)
     Used Dev Size : 1953511424 (1863.01 GiB 2000.40 GB)
      Raid Devices : 2
     Total Devices : 2

             State : active, degraded, recovering  
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 1

Consistency Policy : resync

    Rebuild Status : 4% complete


              UUID : 24021f90:94ce1c7b:2e66a0b1:371dad4c
    Number   Major   Minor   RaidDevice State
       1       8        0        0      active sync   /dev/sda
       0       8       16        1      spare rebuilding   /dev/sdb


Here is the other output requested.

**sjbserver:/home/dao/dao1 #** smartctl -a /dev/sdb    
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.44-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST2000DM008-2FR102
Serial Number:    ZFL0TXMZ
LU WWN Device Id: 5 000c50 0c28d8924
Firmware Version: 0001
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Apr 21 09:43:44 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever  
                                        been run.
Total time to complete Offline  
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine  
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 199) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30a5) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   057   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   098   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       42
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   082   060   045    Pre-fail  Always       -       172652011
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2794 (152 215 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       42
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   098   098   000    Old_age   Always       -       2
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   065   040    Old_age   Always       -       33 (Min/Max 29/33)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       65
193 Load_Cycle_Count        0x0032   098   098   000    Old_age   Always       -       4220
194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always       -       33 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   100   064   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       24
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       24
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       1702 (135 117 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       6725924447
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       11924063004

SMART Error Log Version: 1
ATA Error Count: 3
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 1811 hours (75 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 73 04 82 87 80 e0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 a0 00   8d+15:00:24.745  FLUSH CACHE EXT
  61 00 68 70 32 12 4d 00   8d+15:00:24.744  WRITE FPDMA QUEUED
  61 00 01 ff ff ff 4f 00   8d+15:00:20.155  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00   8d+15:00:20.154  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00   8d+15:00:20.154  WRITE FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 749 hours (31 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 00 64 e0 08  Error: UNC at LBA = 0x08e06400 = 148923392

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 80 73 e0 48 00      14:39:44.190  READ FPDMA QUEUED
  60 00 80 00 73 e0 48 00      14:39:44.190  READ FPDMA QUEUED
  60 00 80 00 6d e0 48 00      14:39:44.178  READ FPDMA QUEUED
  60 00 80 80 6d e0 48 00      14:39:44.178  READ FPDMA QUEUED
  60 00 80 00 6e e0 48 00      14:39:44.178  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 749 hours (31 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 f8 63 e0 08  Error: UNC at LBA = 0x08e063f8 = 148923384

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 00 72 e0 48 00      14:39:43.897  READ FPDMA QUEUED
  60 00 80 80 71 e0 48 00      14:39:43.897  READ FPDMA QUEUED
  60 00 80 00 71 e0 48 00      14:39:43.897  READ FPDMA QUEUED
  60 00 80 80 70 e0 48 00      14:39:43.897  READ FPDMA QUEUED
  60 00 80 00 70 e0 48 00      14:39:43.897  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1804         -
# 2  Short offline       Completed without error       00%      1780         -
# 3  Short offline       Completed without error       00%      1756         -
# 4  Short offline       Completed without error       00%      1732         -  
# 5  Short offline       Completed without error       00%      1709         -
# 6  Short offline       Completed without error       00%      1685         -
# 7  Short offline       Completed without error       00%      1661         -
# 8  Short offline       Completed without error       00%      1637         -
# 9  Short offline       Completed without error       00%      1613         -
#10  Short offline       Completed without error       00%      1589         -
#11  Extended offline    Completed without error       00%      1569         -
#12  Short offline       Completed without error       00%      1541         -
#13  Short offline       Completed without error       00%      1517         -
#14  Short offline       Completed without error       00%      1493         -
#15  Short offline       Completed without error       00%      1469         -
#16  Short offline       Completed without error       00%      1445         -
#17  Short offline       Completed without error       00%      1421         -
#18  Short offline       Completed without error       00%      1397         -
#19  Short offline       Completed without error       00%      1381         -
#20  Short offline       Completed without error       00%      1375         -
#21  Short offline       Completed without error       00%      1351         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

**sjbserver:~ #** 
**sjbserver:~ #** mdadm --detail-platform
       Platform : Intel(R) Rapid Storage Technology
        Version : 17.2.0.3790
    RAID Levels : raid0 raid1 raid10 raid5
    Chunk Sizes : 4k 8k 16k 32k 64k 128k
    2TB volumes : supported
      2TB disks : supported
      Max Disks : 15
    Max Volumes : 2 per array, 4 per controller
 I/O Controller : /sys/devices/pci0000:00/0000:00:17.0 (SATA)
          Port3 : /dev/sdb (ZFL0TXMZ)
          Port0 : - non-disk device (HL-DT-ST DVDRAM GH24NSC0) -
          Port2 : /dev/sda (ZFL0QW0T)
          Port1 : - no device attached -

**sjbserver:~ # **
**sjbserver:~ #** mdadm --examine --verbose /dev/sda
/dev/sda:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.1.00
    Orig Family : ef2bd7c5
         Family : ef2bd7c5
     Generation : 000dc8c9
     Attributes : All supported
           UUID : 5720a588:186d1d2c:088a160e:4464a97f
       Checksum : 69e50442 correct
    MPB Sectors : 2
          Disks : 2
   RAID Devices : 1

  Disk00 Serial : ZFL0QW0T
          State : active
             Id : 00000002
    Usable Size : 3907022848 (1863.01 GiB 2000.40 GB)

[Volume1]:
           UUID : 24021f90:94ce1c7b:2e66a0b1:371dad4c
     RAID Level : 1 <-- 1
        Members : 2 <-- 2
          Slots : [UU] <-- [U_]
    Failed disk : 1
      This Slot : 0
    Sector Size : 512
     Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
   Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
  Sector Offset : 0
    Num Stripes : 15261808
     Chunk Size : 64 KiB <-- 64 KiB
       Reserved : 0
  Migrate State : rebuild
      Map State : normal <-- degraded
     Checkpoint : 397552 (512)
    Dirty State : dirty
     RWH Policy : off

  Disk01 Serial : ZFL0TXMZ
          State : active
             Id : 00000003
    Usable Size : 3907022848 (1863.01 GiB 2000.40 GB)
**sjbserver:~ #** 

… I wourld start over with fresh, high-quality SATA cables and see if it happenz again. Have a look at SMART output from time to time to see, if the errors come up again

https://forums.opensuse.org/showthread.php/509021-SSD-throws-error-quot-WRITE-FPDMA-QUEUED-quot-during-boot

many thanks, raspu!