SMART hdd Airflow temperature error, should I change the drive?

fperal · October 26, 2020, 9:34am

I have a home server running 24/7. Last week it crash after a reboot after a kernel upgrade. It was a disk error (someting about sata interface error if I remember well). I was testing and my conclusion was that it was a bad cable contact, so I changed the sata cable and it worked ok, but just in case I am monitoring the HD.

I had SMART enabled. I have three disks, two of them in a raid1 with the system and the important data, the third wich is the one I’m testing have non much important data, but anyway I would prefer no to loose it.

I have run smarttools, I did a short test and then a long test.

**aldebaran:~ #** smartctl -a /dev/sdc       
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.71-default] (SUSE RPM) 
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org 

=== START OF INFORMATION SECTION === 
Model Family:     Seagate Barracuda 3.5 
Device Model:     ST1000DM010-2EP102 
Serial Number:    ZN101YAD 
LU WWN Device Id: 5 000c50 0b35bd807 
Firmware Version: CC43 
User Capacity:    1,000,204,886,016 bytes [1.00 TB] 
Sector Sizes:     512 bytes logical, 4096 bytes physical 
Rotation Rate:    7200 rpm 
Form Factor:      3.5 inches 
Device is:        In smartctl database [for details use: -P show] 
ATA Version is:   ATA8-ACS T13/1699-D revision 4 
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) 
Local Time is:    Mon Oct 26 09:16:04 2020 CET 
SMART support is: Available - device has SMART capability. 
SMART support is: Enabled 

=== START OF READ SMART DATA SECTION === 
SMART overall-health self-assessment test result: PASSED 
See vendor-specific Attribute list for marginal Attributes. 

General SMART Values: 
Offline data collection status:  (0x00) Offline data collection activity 
                                        was never started. 
                                        Auto Offline Data Collection: Disabled. 
Self-test execution status:      (   0) The previous self-test routine completed 
                                        without error or no self-test has ever  
                                        been run. 
Total time to complete Offline  
data collection:                (    0) seconds. 
Offline data collection 
capabilities:                    (0x73) SMART execute Offline immediate. 
                                        Auto Offline data collection on/off support. 
                                        Suspend Offline collection upon new 
                                        command. 
                                        No Offline surface scan supported. 
                                        Self-test supported. 
                                        Conveyance Self-test supported. 
                                        Selective Self-test supported. 
SMART capabilities:            (0x0003) Saves SMART data before entering 
                                        power-saving mode. 
                                        Supports SMART auto save timer. 
Error logging capability:        (0x01) Error logging supported. 
                                        General Purpose Logging supported. 
Short self-test routine  
recommended polling time:        (   1) minutes. 
Extended self-test routine 
recommended polling time:        ( 104) minutes. 
Conveyance self-test routine 
recommended polling time:        (   2) minutes. 
SCT capabilities:              (0x1085) SCT Status supported. 

SMART Attributes Data Structure revision number: 10 
Vendor Specific SMART Attributes with Thresholds: 
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE 
  1 Raw_Read_Error_Rate     0x000f   079   063   006    Pre-fail  Always       -       93372496 
  3 Spin_Up_Time            0x0003   098   097   000    Pre-fail  Always       -       0 
  4 Start_Stop_Count        0x0032   091   091   020    Old_age   Always       -       9614 
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0 
  7 Seek_Error_Rate         0x000f   080   060   045    Pre-fail  Always       -       103990924 
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       12220 
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0 
 12 Power_Cycle_Count       0x0032   091   091   020    Old_age   Always       -       9615 
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0 
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0 
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0 
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0 
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0 
190 Airflow_Temperature_Cel 0x0022   076   038   040    Old_age   Always   In_the_past 24 (0 14 34 23 0) 
193 Load_Cycle_Count        0x0032   095   095   000    Old_age   Always       -       10037 
194 Temperature_Celsius     0x0022   024   014   000    Old_age   Always       -       24 (0 14 0 0 0) 
195 Hardware_ECC_Recovered  0x001a   004   001   000    Old_age   Always       -       93372496 
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0 
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0 
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0 
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       11888h+29m+42.832s 
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4676414096 
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       11594611688 

SMART Error Log Version: 1 
No Errors Logged 

SMART Self-test log structure revision number 1 
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error 
# 1  Short offline       Completed without error       00%     12213         - 
# 2  Extended offline    Completed without error       00%     12209         - 
# 3  Short offline       Completed without error       00%     12207         - 
# 4  Short offline       Completed without error       00%     12189         - 
# 5  Short offline       Completed without error       00%     12164         - 
# 6  Short offline       Completed without error       00%     12140         - 
# 7  Short offline       Completed without error       00%     12116         - 
# 8  Short offline       Completed without error       00%     12092         - 
# 9  Short offline       Completed without error       00%     12068         - 
#10  Short offline       Completed without error       00%     12044         - 
#11  Short offline       Completed without error       00%     12020         - 
#12  Short offline       Completed without error       00%     11996         - 
#13  Short offline       Completed without error       00%     11972         - 
#14  Short offline       Completed without error       00%     11948         - 
#15  Short offline       Completed without error       00%     11924         - 
#16  Short offline       Completed without error       00%     11900         - 
#17  Short offline       Completed without error       00%     11876         - 
#18  Short offline       Completed without error       00%     11862         - 
#19  Short offline       Interrupted (host reset)      00%     11840         - 
#20  Short offline       Interrupted (host reset)      00%     11819         - 
#21  Short offline       Completed without error       00%     11806         - 

SMART Selective self-test log data structure revision number 1 
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS 
    1        0        0  Not_testing 
    2        0        0  Not_testing 
    3        0        0  Not_testing 
    4        0        0  Not_testing 
    5        0        0  Not_testing 
Selective self-test flags (0x0): 
  After scanning selected spans, do NOT read-scan remainder of disk. 
If Selective self-test is pending on power-up, resume after 0 minute delay. 

**aldebaran:~ #** smartctl --health /dev/sdc       
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.71-default] (SUSE RPM) 
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org 

=== START OF READ SMART DATA SECTION === 
SMART overall-health self-assessment test result: PASSED 
Please note the following marginal Attributes: 
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE 
190 Airflow_Temperature_Cel 0x0022   075   038   040    Old_age   Always   In_the_past 25 (0 14 34 23 0) 

**aldebaran:~ #**

I see the Airflow_Temperature_Cel error.
As it is a old age error I understand as it is said here that it is not a critical error.

As I understand this the disk has been at 76 ºC in some time in the past which is not very good, but when?
I don’t understand the last column, that “[FONT=monospace]24 (0 14 34 23 0)” reported by smart -a /dev/sdc
… which is reported different by [/FONT][FONT=monospace]smartctl --health /dev/sdc “[FONT=monospace]25 (0 14 34 23 0)”

is 24 or 25 the month day of the error?

should I change the disk?

best regards

[/FONT]
[/FONT]

brunomcl · October 26, 2020, 11:38pm

Assuming it is not an actual airflow obstruction on this specific drive, I would not take the chance and change it, 1TB drives are not that expensive nowadays.

Maybe you could also open your box and check the temperature with a Mark-I thermometer (i.e., your fingers :P)

I have a few boxes with ssd’s and hdd’s and their temperatures are usually in the 30-40° range, and never went above 50° yet.

malcolmlewis · October 26, 2020, 11:57pm

Hi
@fperal, you might want to check the manufacturer spec on the actual attribute, the value and raw value may not be actual numbers, especially those read and seek errors. But that said, I would look at replacement… Also did you check the power supply?

tsu2 · October 27, 2020, 7:31am

First,
I’d caution you not to rely too much on your RAID configuration and disk monitoring…
Each is a good step to take in an overall management strategy but each only protects or warns about a very specific few types of incidents. Many more problems can happen than what these measures touch on, if you truly value your data, there is nothing that can substitute for a good backup strategy based on good practices.

As for your Airflow temp error…
I’d differ from others who have posted.
I guess I’m just used to seeing problems reported by S.M.A.R.T that after full analysis haven’t amounted to anything significant (knock on wood, of course).
Any errors you see are blinking, red warning lights to inspect and evaluate, and to watch closely so as to understand if it’s part of a worsening trend or merely an incident in a moment of time that is anomalous and isolated.

The thing to understand about disk data failures is that they don’t happen as a bolt of lightning from nowhere, or at least I haven’t seen it.
S.M.A.R.T can alert you early that something has happened, but if it’s not a problem which had been developing for some time already, it would still take many other incidents to warrant concern.
The fastest I’ve personally witnessed a disk go bad to where I was frightened I could lose data was over a 2 week period. S.M.A.R.T alerted me, and I monitored the disk closely as it continued to fail but I still had enough time to obtain a replacement disk and replace the disk in its array before the disk had to be removed.

IMO,
TSU

karlmistelberger · October 27, 2020, 8:00am

fperal:

I have a home server running 24/7. Last week it crash after a reboot after a kernel upgrade. It was a disk error (someting about sata interface error if I remember well). I was testing and my conclusion was that it was a bad cable contact, so I changed the sata cable and it worked ok, but just in case I am monitoring the HD.

I had SMART enabled. I have three disks, two of them in a raid1 with the system and the important data, the third wich is the one I’m testing have non much important data, but anyway I would prefer no to loose it.

I have run smarttools, I did a short test and then a long test.

I see the Airflow_Temperature_Cel error.
As it is a old age error I understand as it is said [here](Home | Unraid Docs" target="_blank) that it is not a critical error.

As I understand this the disk has been at 76 ºC in some time in the past which is not very good, but when?
I don’t understand the last column, that “24 (0 14 34 23 0)” reported by smart -a /dev/sdc
… which is reported different by smartctl --health /dev/sdc “25 (0 14 34 23 0)”

is 24 or 25 the month day of the error?

should I change the disk?

Replace the HDD by a SSD. I started using SSDs in 2014 and gradually replaced my HDDs with SSDs becoming cheaper over the years. I keep the HDDs installed and switched to standby. They are good enough for backup:

erlangen:~ # inxi -D 
Drives:    Local Storage: total: 6.38 TiB used: 1.52 TiB (23.9%)  
           ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 950 PRO 512GB size: 476.94 GiB  
           ID-2: /dev/sda vendor: Western Digital model: WD40EZRX-22SPEB0 size: 3.64 TiB  
           ID-3: /dev/sdb vendor: Crucial model: CT2000BX500SSD1 size: 1.82 TiB  
           ID-4: /dev/sdc vendor: Samsung model: SSD 850 EVO 500GB size: 465.76 GiB  
erlangen:~ #

Svyatko · October 27, 2020, 2:23pm

karlmistelberger:

Replace the HDD by a SSD. I started using SSDs in 2014 and gradually replaced my HDDs with SSDs becoming cheaper over the years. I keep the HDDs installed and switched to standby. They are good enough for backup:
erlangen:~ # inxi -D 
Drives:    Local Storage: total: 6.38 TiB used: 1.52 TiB (23.9%)  
           ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 950 PRO 512GB size: 476.94 GiB  
           ID-2: /dev/sda vendor: Western Digital model: WD40EZRX-22SPEB0 size: 3.64 TiB  
           ID-3: /dev/sdb vendor: Crucial model: CT2000BX500SSD1 size: 1.82 TiB  
           ID-4: /dev/sdc vendor: Samsung model: SSD 850 EVO 500GB size: 465.76 GiB  
erlangen:~ #

Samsung 980 PRO became EVO: TLC (3 bits per cell), not MLC (2 bits per cell).
Torrents kills SSDs.
SSDs dies suddenly.

tsu2 · October 27, 2020, 9:17pm

First, I’d like to say how nice your Samsung 980 is…
But, I’d also be curious why you might think that torrents might be a problem with it.
I’m running on a Gen1 Micron NVME, it’s quite a bit slower than your stick(still no slouch), and hasn’t had any problem with anything I’ve run on it.

TSU