System incredibly slow after new install but with failing raid mirror?

So I’ve been fighting constant issues with my LEAP 15 system the past couple weeks since triggering an update from 42.3. I ended up wiping the system drives and doing a fresh install of LEAP 15 after it would no longer boot, but it wasn’t without issues (problems during install include not able to designate my /home partition and various processes unable to start up during boot). For a brand new install, it takes approx 12 to 15 min to boot, and there are no errors that I can tell… it’s just REALLY slow going through the bootup load sequence, and it seems to me like it lists a lot of actions twice (which doesn’t seem right to me). I’ve been combing through various things on the system to try to figure out why, when I came across the Smartd report for my /dev/sdb drive, which is in RAID 1 via mdadm with sda. These drives make up three partitions; /, /boot, and /home. Here’s the current smartd report for /dev/sdb:

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp150.12.16-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD10EALX-089BA0
Serial Number:    WD-WCATR6102480
LU WWN Device Id: 5 0014ee 20595c862
Firmware Version: 15.01H15
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Oct  3 21:35:11 2018 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x85)    Offline data collection activity
                    was aborted by an interrupting command from host.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (  73)    The previous self-test completed having
                    a test element that failed and the test
                    element that failed is not known.
Total time to complete Offline 
data collection:         (17280) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 200) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x3037)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   183   173   021    Pre-fail  Always       -       3850
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       141
  5 Reallocated_Sector_Ct   0x0033   082   082   140    Pre-fail  Always   FAILING_NOW 940
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   019   019   000    Old_age   Always       -       59290
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       140
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       100
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       40
194 Temperature_Celsius     0x0022   109   098   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       634
197 Current_Pending_Sector  0x0032   199   196   000    Old_age   Always       -       233
198 Offline_Uncorrectable   0x0030   197   197   000    Old_age   Offline      -       558
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   079   079   000    Old_age   Offline      -       24341

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: unknown failure    90%     59290         -
# 2  Short offline       Completed: unknown failure    90%     59290         -
# 3  Short offline       Completed: read failure       90%     59272         1146730494
# 4  Short offline       Completed: read failure       90%     59129         1022531021
# 5  Short offline       Completed: read failure       90%     59106         1434479156
# 6  Short offline       Completed: read failure       90%     59082         1434479156
# 7  Short offline       Completed: read failure       90%     59058         1434479158
# 8  Short offline       Completed: read failure       10%     59034         1434479184
# 9  Short offline       Completed: read failure       70%     59010         1202331753
#10  Short offline       Completed: read failure       90%     58986         1434479158
#11  Short offline       Completed: read failure       90%     58962         1434479158
#12  Conveyance offline  Completed without error       00%     33309         -
#13  Short offline       Completed without error       00%     33198         -
#14  Extended offline    Completed without error       00%     33121         -
#15  Extended offline    Completed without error       00%     12363         -
#16  Extended offline    Completed without error       00%      8134         -
#17  Short offline       Completed without error       00%      7246         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The drive is failing slowly. But what I can’t wrap my head around is this is a mirror of sda, which is perfectly healthy. Would all my issues be due to just the one failing drive?

What other elements should I check here to see that I don’t have some other issues with the OS install? The wait percentage in top does seem to spike from time to time… so maybe I’m looking for something else that isn’t there? I don’t have much experience with mdadm and drive failures if this is the sole issue.

Thanks in advance for any insight you can offer.

Fail is fail. When Hardware goes things break. If you have a failing drive it can cause serious slowdowns as the system will try and retry to successfully write or read from the bad media. Most spinning rust drive will tick for a while trying to read the bad data.Solution replace the drive.

Fair enough! Thanks for the reply. I’ll go and buy new drives tonight after work. I’m still a bit miffed why the OS took a dive and needed a reinstall after this occurred… thought the whole reason for running a system drive in RAID 1 was to maintain data integrity and uptime. :sarcastic: The system should have emailed me if it had any precursor to this issue as well… so I’ll be diving more into that as I do the rebuild.

Well that all assume the hardware works right does it not??.

RAID does not substitute for backup.

Agreed it is not a substitute for backup, and I have offsite backups occurring and it also looks like the local data is intact (a nice bonus). However, to me RAID also means operationally available in the event of a single HDD failure. I had a single failure, but the system still didn’t work (was experiencing issues and then would not boot)… so to me that points to an issue with how the RAID was being handled or some other configuration problem. I’m just at a loss at the moment as to why that is and where it could be an issue.

But it has not totally failed it is just very slow. If you removed the drive from the array and substituted another then it should rebuild the array

Think you missed this part of my original post: :wink:

The original installed system fell over… so I’m trying to backtrack my thoughts on what I possibly did wrong for this to just die if a single HDD failure is the culprit. The new install is functional but incredibly slow…