I have a home server running 24/7. Last week it crash after a reboot after a kernel upgrade. It was a disk error (someting about sata interface error if I remember well). I was testing and my conclusion was that it was a bad cable contact, so I changed the sata cable and it worked ok, but just in case I am monitoring the HD.
I had SMART enabled. I have three disks, two of them in a raid1 with the system and the important data, the third wich is the one I’m testing have non much important data, but anyway I would prefer no to loose it.
I have run smarttools, I did a short test and then a long test.
**aldebaran:~ #** smartctl -a /dev/sdc
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.71-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 3.5
Device Model: ST1000DM010-2EP102
Serial Number: ZN101YAD
LU WWN Device Id: 5 000c50 0b35bd807
Firmware Version: CC43
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Mon Oct 26 09:16:04 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 104) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x1085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 079 063 006 Pre-fail Always - 93372496
3 Spin_Up_Time 0x0003 098 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 091 091 020 Old_age Always - 9614
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 080 060 045 Pre-fail Always - 103990924
9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 12220
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 091 091 020 Old_age Always - 9615
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 076 038 040 Old_age Always In_the_past 24 (0 14 34 23 0)
193 Load_Cycle_Count 0x0032 095 095 000 Old_age Always - 10037
194 Temperature_Celsius 0x0022 024 014 000 Old_age Always - 24 (0 14 0 0 0)
195 Hardware_ECC_Recovered 0x001a 004 001 000 Old_age Always - 93372496
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 11888h+29m+42.832s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 4676414096
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 11594611688
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 12213 -
# 2 Extended offline Completed without error 00% 12209 -
# 3 Short offline Completed without error 00% 12207 -
# 4 Short offline Completed without error 00% 12189 -
# 5 Short offline Completed without error 00% 12164 -
# 6 Short offline Completed without error 00% 12140 -
# 7 Short offline Completed without error 00% 12116 -
# 8 Short offline Completed without error 00% 12092 -
# 9 Short offline Completed without error 00% 12068 -
#10 Short offline Completed without error 00% 12044 -
#11 Short offline Completed without error 00% 12020 -
#12 Short offline Completed without error 00% 11996 -
#13 Short offline Completed without error 00% 11972 -
#14 Short offline Completed without error 00% 11948 -
#15 Short offline Completed without error 00% 11924 -
#16 Short offline Completed without error 00% 11900 -
#17 Short offline Completed without error 00% 11876 -
#18 Short offline Completed without error 00% 11862 -
#19 Short offline Interrupted (host reset) 00% 11840 -
#20 Short offline Interrupted (host reset) 00% 11819 -
#21 Short offline Completed without error 00% 11806 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
**aldebaran:~ #** smartctl --health /dev/sdc
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.71-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022 075 038 040 Old_age Always In_the_past 25 (0 14 34 23 0)
**aldebaran:~ #**
I see the Airflow_Temperature_Cel error.
As it is a old age error I understand as it is said here that it is not a critical error.
As I understand this the disk has been at 76 ºC in some time in the past which is not very good, but when?
I don’t understand the last column, that “[FONT=monospace]24 (0 14 34 23 0)” reported by smart -a /dev/sdc
… which is reported different by [/FONT][FONT=monospace]smartctl --health /dev/sdc “[FONT=monospace]25 (0 14 34 23 0)”
is 24 or 25 the month day of the error?
should I change the disk?
best regards
[/FONT]
[/FONT]