Results 1 to 7 of 7

Thread: SMART hdd Airflow temperature error, should I change the drive?

  1. #1

    Default SMART hdd Airflow temperature error, should I change the drive?

    I have a home server running 24/7. Last week it crash after a reboot after a kernel upgrade. It was a disk error (someting about sata interface error if I remember well). I was testing and my conclusion was that it was a bad cable contact, so I changed the sata cable and it worked ok, but just in case I am monitoring the HD.

    I had SMART enabled. I have three disks, two of them in a raid1 with the system and the important data, the third wich is the one I'm testing have non much important data, but anyway I would prefer no to loose it.

    I have run smarttools, I did a short test and then a long test.

    Code:
    aldebaran:~ # smartctl -a /dev/sdc       
    smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.71-default] (SUSE RPM) 
    Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org 
    
    === START OF INFORMATION SECTION === 
    Model Family:     Seagate Barracuda 3.5 
    Device Model:     ST1000DM010-2EP102 
    Serial Number:    ZN101YAD 
    LU WWN Device Id: 5 000c50 0b35bd807 
    Firmware Version: CC43 
    User Capacity:    1,000,204,886,016 bytes [1.00 TB] 
    Sector Sizes:     512 bytes logical, 4096 bytes physical 
    Rotation Rate:    7200 rpm 
    Form Factor:      3.5 inches 
    Device is:        In smartctl database [for details use: -P show] 
    ATA Version is:   ATA8-ACS T13/1699-D revision 4 
    SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) 
    Local Time is:    Mon Oct 26 09:16:04 2020 CET 
    SMART support is: Available - device has SMART capability. 
    SMART support is: Enabled 
    
    === START OF READ SMART DATA SECTION === 
    SMART overall-health self-assessment test result: PASSED 
    See vendor-specific Attribute list for marginal Attributes. 
    
    General SMART Values: 
    Offline data collection status:  (0x00) Offline data collection activity 
                                            was never started. 
                                            Auto Offline Data Collection: Disabled. 
    Self-test execution status:      (   0) The previous self-test routine completed 
                                            without error or no self-test has ever  
                                            been run. 
    Total time to complete Offline  
    data collection:                (    0) seconds. 
    Offline data collection 
    capabilities:                    (0x73) SMART execute Offline immediate. 
                                            Auto Offline data collection on/off support. 
                                            Suspend Offline collection upon new 
                                            command. 
                                            No Offline surface scan supported. 
                                            Self-test supported. 
                                            Conveyance Self-test supported. 
                                            Selective Self-test supported. 
    SMART capabilities:            (0x0003) Saves SMART data before entering 
                                            power-saving mode. 
                                            Supports SMART auto save timer. 
    Error logging capability:        (0x01) Error logging supported. 
                                            General Purpose Logging supported. 
    Short self-test routine  
    recommended polling time:        (   1) minutes. 
    Extended self-test routine 
    recommended polling time:        ( 104) minutes. 
    Conveyance self-test routine 
    recommended polling time:        (   2) minutes. 
    SCT capabilities:              (0x1085) SCT Status supported. 
    
    SMART Attributes Data Structure revision number: 10 
    Vendor Specific SMART Attributes with Thresholds: 
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE 
      1 Raw_Read_Error_Rate     0x000f   079   063   006    Pre-fail  Always       -       93372496 
      3 Spin_Up_Time            0x0003   098   097   000    Pre-fail  Always       -       0 
      4 Start_Stop_Count        0x0032   091   091   020    Old_age   Always       -       9614 
      5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0 
      7 Seek_Error_Rate         0x000f   080   060   045    Pre-fail  Always       -       103990924 
      9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       12220 
     10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0 
     12 Power_Cycle_Count       0x0032   091   091   020    Old_age   Always       -       9615 
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0 
    184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0 
    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0 
    188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0 
    189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0 
    190 Airflow_Temperature_Cel 0x0022   076   038   040    Old_age   Always   In_the_past 24 (0 14 34 23 0) 
    193 Load_Cycle_Count        0x0032   095   095   000    Old_age   Always       -       10037 
    194 Temperature_Celsius     0x0022   024   014   000    Old_age   Always       -       24 (0 14 0 0 0) 
    195 Hardware_ECC_Recovered  0x001a   004   001   000    Old_age   Always       -       93372496 
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0 
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0 
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0 
    240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       11888h+29m+42.832s 
    241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4676414096 
    242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       11594611688 
    
    SMART Error Log Version: 1 
    No Errors Logged 
    
    SMART Self-test log structure revision number 1 
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error 
    # 1  Short offline       Completed without error       00%     12213         - 
    # 2  Extended offline    Completed without error       00%     12209         - 
    # 3  Short offline       Completed without error       00%     12207         - 
    # 4  Short offline       Completed without error       00%     12189         - 
    # 5  Short offline       Completed without error       00%     12164         - 
    # 6  Short offline       Completed without error       00%     12140         - 
    # 7  Short offline       Completed without error       00%     12116         - 
    # 8  Short offline       Completed without error       00%     12092         - 
    # 9  Short offline       Completed without error       00%     12068         - 
    #10  Short offline       Completed without error       00%     12044         - 
    #11  Short offline       Completed without error       00%     12020         - 
    #12  Short offline       Completed without error       00%     11996         - 
    #13  Short offline       Completed without error       00%     11972         - 
    #14  Short offline       Completed without error       00%     11948         - 
    #15  Short offline       Completed without error       00%     11924         - 
    #16  Short offline       Completed without error       00%     11900         - 
    #17  Short offline       Completed without error       00%     11876         - 
    #18  Short offline       Completed without error       00%     11862         - 
    #19  Short offline       Interrupted (host reset)      00%     11840         - 
    #20  Short offline       Interrupted (host reset)      00%     11819         - 
    #21  Short offline       Completed without error       00%     11806         - 
    
    SMART Selective self-test log data structure revision number 1 
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS 
        1        0        0  Not_testing 
        2        0        0  Not_testing 
        3        0        0  Not_testing 
        4        0        0  Not_testing 
        5        0        0  Not_testing 
    Selective self-test flags (0x0): 
      After scanning selected spans, do NOT read-scan remainder of disk. 
    If Selective self-test is pending on power-up, resume after 0 minute delay. 
    
    aldebaran:~ # smartctl --health /dev/sdc       
    smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.71-default] (SUSE RPM) 
    Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org 
    
    === START OF READ SMART DATA SECTION === 
    SMART overall-health self-assessment test result: PASSED 
    Please note the following marginal Attributes: 
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE 
    190 Airflow_Temperature_Cel 0x0022   075   038   040    Old_age   Always   In_the_past 25 (0 14 34 23 0) 
    
    aldebaran:~ #
    

    I see the Airflow_Temperature_Cel error.
    As it is a old age error I understand as it is said here that it is not a critical error.

    As I understand this the disk has been at 76 ºC in some time in the past which is not very good, but when?
    I don't understand the last column, that "24 (0 14 34 23 0)" reported by smart -a /dev/sdc
    ..... which is reported different by
    smartctl --health /dev/sdc "25 (0 14 34 23 0)"

    is 24 or 25 the month day of the error?


    should I change the disk?


    best regards



  2. #2
    Join Date
    Aug 2008
    Location
    Brazil
    Posts
    3,066

    Default Re: SMART hdd Airflow temperature error, should I change the drive?

    Assuming it is not an actual airflow obstruction on this specific drive, I would not take the chance and change it, 1TB drives are not that expensive nowadays.

    Maybe you could also open your box and check the temperature with a Mark-I thermometer (i.e., your fingers )

    I have a few boxes with ssd's and hdd's and their temperatures are usually in the 30-40° range, and never went above 50° yet.

  3. #3
    Join Date
    Jun 2008
    Location
    Podunk
    Posts
    29,810
    Blog Entries
    15

    Default Re: SMART hdd Airflow temperature error, should I change the drive?

    Hi
    @fperal, you might want to check the manufacturer spec on the actual attribute, the value and raw value may not be actual numbers, especially those read and seek errors. But that said, I would look at replacement... Also did you check the power supply?
    Cheers Malcolm °¿° SUSE Knowledge Partner (Linux Counter #276890)
    SUSE SLE, openSUSE Leap/Tumbleweed (x86_64) | GNOME DE
    If you find this post helpful and are logged into the web interface,
    please show your appreciation and click on the star below... Thanks!

  4. #4
    Join Date
    Jun 2008
    Location
    San Diego, Ca, USA
    Posts
    12,868
    Blog Entries
    2

    Default Re: SMART hdd Airflow temperature error, should I change the drive?

    First,
    I'd caution you not to rely too much on your RAID configuration and disk monitoring...
    Each is a good step to take in an overall management strategy but each only protects or warns about a very specific few types of incidents. Many more problems can happen than what these measures touch on, if you truly value your data, there is nothing that can substitute for a good backup strategy based on good practices.

    As for your Airflow temp error...
    I'd differ from others who have posted.
    I guess I'm just used to seeing problems reported by S.M.A.R.T that after full analysis haven't amounted to anything significant (knock on wood, of course).
    Any errors you see are blinking, red warning lights to inspect and evaluate, and to watch closely so as to understand if it's part of a worsening trend or merely an incident in a moment of time that is anomalous and isolated.

    The thing to understand about disk data failures is that they don't happen as a bolt of lightning from nowhere, or at least I haven't seen it.
    S.M.A.R.T can alert you early that something has happened, but if it's not a problem which had been developing for some time already, it would still take many other incidents to warrant concern.
    The fastest I've personally witnessed a disk go bad to where I was frightened I could lose data was over a 2 week period. S.M.A.R.T alerted me, and I monitored the disk closely as it continued to fail but I still had enough time to obtain a replacement disk and replace the disk in its array before the disk had to be removed.

    IMO,
    TSU
    Beginner Wiki Quickstart - https://en.opensuse.org/User:Tsu2/Quickstart_Wiki
    Solved a problem recently? Create a wiki page for future personal reference!
    Learn something new?
    Attended a computing event?
    Post and Share!

  5. #5
    Join Date
    Jan 2014
    Location
    Erlangen
    Posts
    2,037
    Blog Entries
    1

    Default Re: SMART hdd Airflow temperature error, should I change the drive?

    Quote Originally Posted by fperal View Post
    I have a home server running 24/7. Last week it crash after a reboot after a kernel upgrade. It was a disk error (someting about sata interface error if I remember well). I was testing and my conclusion was that it was a bad cable contact, so I changed the sata cable and it worked ok, but just in case I am monitoring the HD.

    I had SMART enabled. I have three disks, two of them in a raid1 with the system and the important data, the third wich is the one I'm testing have non much important data, but anyway I would prefer no to loose it.

    I have run smarttools, I did a short test and then a long test.

    I see the Airflow_Temperature_Cel error.
    As it is a old age error I understand as it is said here that it is not a critical error.

    As I understand this the disk has been at 76 ºC in some time in the past which is not very good, but when?
    I don't understand the last column, that "24 (0 14 34 23 0)" reported by smart -a /dev/sdc
    ..... which is reported different by smartctl --health /dev/sdc "25 (0 14 34 23 0)"

    is 24 or 25 the month day of the error?

    should I change the disk?
    Replace the HDD by a SSD. I started using SSDs in 2014 and gradually replaced my HDDs with SSDs becoming cheaper over the years. I keep the HDDs installed and switched to standby. They are good enough for backup:
    Code:
    erlangen:~ # inxi -D 
    Drives:    Local Storage: total: 6.38 TiB used: 1.52 TiB (23.9%)  
               ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 950 PRO 512GB size: 476.94 GiB  
               ID-2: /dev/sda vendor: Western Digital model: WD40EZRX-22SPEB0 size: 3.64 TiB  
               ID-3: /dev/sdb vendor: Crucial model: CT2000BX500SSD1 size: 1.82 TiB  
               ID-4: /dev/sdc vendor: Samsung model: SSD 850 EVO 500GB size: 465.76 GiB  
    erlangen:~ #
    AMD Athlon 4850e (2009), openSUSE 13.1, KDE 4, Intel i3-4130 (2014), i7-6700K (2016), i5-8250U (2018), AMD Ryzen 5 3400G (2020), openSUSE Tumbleweed, KDE Plasma 5

  6. #6

    Default Re: SMART hdd Airflow temperature error, should I change the drive?

    Quote Originally Posted by karlmistelberger View Post
    Replace the HDD by a SSD. I started using SSDs in 2014 and gradually replaced my HDDs with SSDs becoming cheaper over the years. I keep the HDDs installed and switched to standby. They are good enough for backup:
    Code:
    erlangen:~ # inxi -D 
    Drives:    Local Storage: total: 6.38 TiB used: 1.52 TiB (23.9%)  
               ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 950 PRO 512GB size: 476.94 GiB  
               ID-2: /dev/sda vendor: Western Digital model: WD40EZRX-22SPEB0 size: 3.64 TiB  
               ID-3: /dev/sdb vendor: Crucial model: CT2000BX500SSD1 size: 1.82 TiB  
               ID-4: /dev/sdc vendor: Samsung model: SSD 850 EVO 500GB size: 465.76 GiB  
    erlangen:~ #
    Samsung 980 PRO became EVO: TLC (3 bits per cell), not MLC (2 bits per cell).
    Torrents kills SSDs.
    SSDs dies suddenly.

  7. #7
    Join Date
    Jun 2008
    Location
    San Diego, Ca, USA
    Posts
    12,868
    Blog Entries
    2

    Default Re: SMART hdd Airflow temperature error, should I change the drive?

    Quote Originally Posted by Svyatko View Post
    Samsung 980 PRO became EVO: TLC (3 bits per cell), not MLC (2 bits per cell).
    Torrents kills SSDs.
    SSDs dies suddenly.
    First, I'd like to say how nice your Samsung 980 is...
    But, I'd also be curious why you might think that torrents might be a problem with it.
    I'm running on a Gen1 Micron NVME, it's quite a bit slower than your stick(still no slouch), and hasn't had any problem with anything I've run on it.

    TSU
    Beginner Wiki Quickstart - https://en.opensuse.org/User:Tsu2/Quickstart_Wiki
    Solved a problem recently? Create a wiki page for future personal reference!
    Learn something new?
    Attended a computing event?
    Post and Share!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •