HHD going bad?

Hi,
I just started getting this message a few days ago.

With a yellow Warning sign.from “Sorry - KDialog”. “**Your hard disk drive is failing!
** S.M.A.R.T. message: Device: /dev/sda.1”

OS 11.0, Linux 2.6.25.16-0.1 -default x86_64

Things seem to work except for this message at boot up.
How does this work? What do I need to do?

Thanks,
Ron:’(

what - No error log or details?

Self-Monitoring, Analysis, and Reporting Technology - Wikipedia, the free encyclopedia.

I don’t know where to find and error log. The wikipedia article I didn’t see anything more than what S.M.A.R.T. is and that’s all above my level.

Thanks,
Ron

You may need to act quickly. Read the above ref’d article to understand what SMART is about. In openSUSE, there is a background service that runs monitoring SMART status on all your SMART enabled drives. The message indicates that SMART has detected an impending hardware problem.

You can use the smartctl program to see the current SMART status (the on-going self-monitoring the drive does) and also to test the drive. Take a look at the man page for all the options, but this will get you to what you need most:

smartctl -a /dev/sda

smartctl -t long /dev/sda

Before you take another breath, back up your data.

Did that. Thanks, I wonder if this drive is really going to croak. It’s not very old.

The smartctl command I gave you above will tell you what SMART is reporting, which is coming from the firmware in the drive. There is a very small but predictable percentage of failures to infant mortality.

Sorry for not posting the output.

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar SE family
Device Model: WDC WD3200JB-00KFA0
Serial Number: WD-WCAMR4003329
Firmware Version: 08.05J08
User Capacity: 320,072,933,376 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Oct 12 21:35:30 2008 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (9600) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 116) minutes.
Conveyance self-test routine
recommended polling time: ( 6) minutes.
SCT capabilities: (0x001f) SCT Status supported.
SCT Feature Control supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 211 176 021 Pre-fail Always - 4408
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 539
5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 4
7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 8424
10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 539
194 Temperature_Celsius 0x0022 108 102 000 Old_age Always - 42
196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 47
200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0

SMART Error Log Version: 1
ATA Error Count: 426 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It “wraps” after 49.710 days.

Error 426 occurred at disk power-on lifetime: 8417 hours (350 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 01 58 cd fa e0 Error: UNC 1 sectors at LBA = 0x00facd58 = 16436568

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


25 00 08 51 cd fa 13 58 00:34:40.700 READ DMA EXT
27 00 00 00 00 00 00 58 00:34:40.680 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 58 00:34:40.675 IDENTIFY DEVICE
ef 03 45 00 00 00 00 58 00:34:40.675 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 58 00:34:40.675 READ NATIVE MAX ADDRESS EXT

Error 425 occurred at disk power-on lifetime: 8417 hours (350 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 01 58 cd fa e0 Error: UNC 1 sectors at LBA = 0x00facd58 = 16436568

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


25 00 08 51 cd fa 13 58 00:34:38.675 READ DMA EXT
27 00 00 00 00 00 00 58 00:34:38.660 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 58 00:34:38.650 IDENTIFY DEVICE
ef 03 45 00 00 00 00 58 00:34:38.650 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 58 00:34:38.650 READ NATIVE MAX ADDRESS EXT

Error 424 occurred at disk power-on lifetime: 8417 hours (350 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 01 58 cd fa e0 Error: UNC 1 sectors at LBA = 0x00facd58 = 16436568

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


25 00 08 51 cd fa 13 58 00:34:36.650 READ DMA EXT
27 00 00 00 00 00 00 58 00:34:36.635 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 58 00:34:36.625 IDENTIFY DEVICE
ef 03 45 00 00 00 00 58 00:34:36.625 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 58 00:34:36.625 READ NATIVE MAX ADDRESS EXT

Error 423 occurred at disk power-on lifetime: 8417 hours (350 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 01 58 cd fa e0 Error: UNC 1 sectors at LBA = 0x00facd58 = 16436568

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


25 00 08 51 cd fa 13 58 00:34:34.465 READ DMA EXT
27 00 00 00 00 00 00 58 00:34:34.445 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 58 00:34:34.440 IDENTIFY DEVICE
ef 03 45 00 00 00 00 58 00:34:34.440 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 58 00:34:34.440 READ NATIVE MAX ADDRESS EXT

Error 422 occurred at disk power-on lifetime: 8417 hours (350 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 01 58 cd fa e0 Error: UNC 1 sectors at LBA = 0x00facd58 = 16436568

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


25 00 08 51 cd fa 13 58 00:34:32.435 READ DMA EXT
27 00 00 00 00 00 00 58 00:34:32.420 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 58 00:34:32.410 IDENTIFY DEVICE
ef 03 45 00 00 00 00 58 00:34:32.410 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 58 00:34:32.410 READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed: read failure 90% 8423 335203672

2 Extended offline Completed: read failure 90% 8418 335203672

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):

This is the second command. I don’t understand either of them.

Thanks,
Ron

danorske@linux-9bda:~> su
Password:
linux-9bda:/home/danorske # smartctl -t long /dev/sda
smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-8 by Bruce Allen, smartmontools Home Page (last updated $Date: 2008/07/25 10:43:23 $)

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: “Execute SMART Extended self-test routine immediately in off-line mode”.
Drive command “Execute SMART Extended self-test routine immediately in off-line mode” successful.
Testing has begun.
Please wait 116 minutes for test to complete.
Test will complete after Sun Oct 12 23:37:52 2008

Use smartctl -X to abort test.
linux-9bda:/home/danorske #

As stated backup your data ASAP and get a new drive

Geoff

Thanks to All that have replied. I’ll get a new drive.
Regards,
Ron

The output of “smartctl -a” shows you the errors being thrown by the SMART firmware in the drive. It appears these errors are occuring when the drive is first spun up.

Try doing this now

smartctl -H

SMART will give you a “bottom line” assessment of failed or passed. If it says failed then this means the drive has already failed or is predicted to fail within 24 hours. If this says passed, then do this


smartctl -t -l selftest

This is the SMART test I recommended above; the output was written to the SMART log or to the terminal. (That is an “l” as in “lake”.) This will force the log to be written to the terminal, so just leave the window open and running.

The Caviar series drive you have is an excellent drive (I have a half-dozen). But any drive can fail. Generally when they do, they are either very new or very old. Contact Western Digital or the computer manufacturer with the SMART information.

On Mon, 2008-10-13 at 01:56 +0000, danorske wrote:
> swerdna;1882491 Wrote:
> > Before you take another breath, back up your data.
>
> Did that. Thanks, I wonder if this drive is really going to croak. It’s
> not very old.
>
>

>From the other messages in the thread, the drive is bad, but if it’s
not that old, it may be under warranty and so you can get a new one
without paying for the drive.

linux-9bda:/home/danorske # smartctl -H /dev/sda
smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-8 by Bruce Allen, smartmontools Home Page (last updated $Date: 2008/07/25 10:43:23 $)

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

linux-9bda:/home/danorske #

Then I run the selftest and got this.

linux-9bda:/home/danorske # smartctl -t -l selftest
smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-8 by Bruce Allen, smartmontools Home Page (last updated $Date: 2008/07/25 10:43:23 $)

=======> INVALID ARGUMENT TO -t: -l
=======> VALID ARGUMENTS ARE: offline, short, long, conveyance, select,M-N, pending,N, afterselect,[on|off], scttempint,N,p] <=======

Use smartctl -h to get a usage summary

linux-9bda:/home/danorske #

Hi
You need to specify the device, eg /dev/sda and add the options to your
switches eg;


kermit-opensuse:~ # smartctl -t short -l error /dev/sda
smartctl 5.39 2008-05-08 21:56 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-8 by Bruce Allen,
http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in
off-line mode".
Drive command "Execute SMART Short self-test routine immediately in
off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Oct 13 20:12:20 2008

Use smartctl -X to abort test.


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.16-0.1-default
up 20:22, 2 users, load average: 0.09, 0.04, 0.03
GPU GeForce 6600 TE/6200 TE - Driver Version: 177.80

OK,
Here is the output of that command.

linux-9bda:/home/danorske # smartctl -t short -l error /dev/sda
smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 426 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It “wraps” after 49.710 days.

Error 426 occurred at disk power-on lifetime: 8417 hours (350 days + 17 hours)
When the command that caused the error occurred, the device was active or idle

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 01 58 cd fa e0 Error: UNC 1 sectors at LBA = 0x00facd58 = 16436568

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


25 00 08 51 cd fa 13 58 00:34:40.700 READ DMA EXT
27 00 00 00 00 00 00 58 00:34:40.680 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 58 00:34:40.675 IDENTIFY DEVICE
ef 03 45 00 00 00 00 58 00:34:40.675 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 58 00:34:40.675 READ NATIVE MAX ADDRESS EXT

Error 425 occurred at disk power-on lifetime: 8417 hours (350 days + 17 hours)
When the command that caused the error occurred, the device was active or idle

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 01 58 cd fa e0 Error: UNC 1 sectors at LBA = 0x00facd58 = 16436568

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


25 00 08 51 cd fa 13 58 00:34:38.675 READ DMA EXT
27 00 00 00 00 00 00 58 00:34:38.660 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 58 00:34:38.650 IDENTIFY DEVICE
ef 03 45 00 00 00 00 58 00:34:38.650 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 58 00:34:38.650 READ NATIVE MAX ADDRESS EXT

Error 424 occurred at disk power-on lifetime: 8417 hours (350 days + 17 hours)
When the command that caused the error occurred, the device was active or idle

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 01 58 cd fa e0 Error: UNC 1 sectors at LBA = 0x00facd58 = 16436568

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


25 00 08 51 cd fa 13 58 00:34:36.650 READ DMA EXT
27 00 00 00 00 00 00 58 00:34:36.635 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 58 00:34:36.625 IDENTIFY DEVICE
ef 03 45 00 00 00 00 58 00:34:36.625 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 58 00:34:36.625 READ NATIVE MAX ADDRESS EXT

Error 423 occurred at disk power-on lifetime: 8417 hours (350 days + 17 hours)
When the command that caused the error occurred, the device was active or idle

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 01 58 cd fa e0 Error: UNC 1 sectors at LBA = 0x00facd58 = 16436568

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


25 00 08 51 cd fa 13 58 00:34:34.465 READ DMA EXT
27 00 00 00 00 00 00 58 00:34:34.445 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 58 00:34:34.440 IDENTIFY DEVICE
ef 03 45 00 00 00 00 58 00:34:34.440 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 58 00:34:34.440 READ NATIVE MAX ADDRESS EXT

Error 422 occurred at disk power-on lifetime: 8417 hours (350 days + 17 hours)
When the command that caused the error occurred, the device was active or idle

After command completion occurred, registers were:
ER ST SC SN CL CH DH


40 51 01 58 cd fa e0 Error: UNC 1 sectors at LBA = 0x00facd58 = 16436568

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name


25 00 08 51 cd fa 13 58 00:34:32.435 READ DMA EXT
27 00 00 00 00 00 00 58 00:34:32.420 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 58 00:34:32.410 IDENTIFY DEVICE
ef 03 45 00 00 00 00 58 00:34:32.410 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 58 00:34:32.410 READ NATIVE MAX ADDRESS EXT

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line
Drive command "Execute SMART Short self-test routine immediately in off-line mod
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Oct 13 20:41:31 2008

Use smartctl -X to abort test.
linux-9bda:/home/danorske #

then I tried this. Does this mean anything?

linux-9bda:/home/danorske # smartctl -t short -l selftest /dev/sda
smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-8 by Bruce Allen, smartmontools Home Page (last updated $Date: 2008/07/25 10:43:23 $)

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error

1 Short offline Completed: read failure 90% 8436 335 203672

2 Short offline Completed: read failure 90% 8436 335 203672

3 Extended offline Completed: read failure 90% 8424 335 203672

4 Extended offline Completed: read failure 90% 8423 335 203672

5 Extended offline Completed: read failure 90% 8418 335 203672

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: “Execute SMART Short self-test routine immediately in off-line mode”.
Drive command “Execute SMART Short self-test routine immediately in off-line mod e” successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Oct 13 20:48:08 2008

Use smartctl -X to abort test.
linux-9bda:/home/danorske #

Sorry about leaving off the device name in the above commands - my bad.

The drive is failing. SMART is predicting a high probability a failure will occur shortly. Looks like the drive is 350 days old, which might just be inside the warranty.

Hi

That to me would indicate it’s a particular area of the drive that is
gone faulty.

Is there a manufacturers diagnostic tool you can run?

Here is a further output from mine, no errors…


kermit-opensuse:~ # smartctl -l selftest /dev/sda
smartctl 5.39 2008-05-08 21:56 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-8 by Bruce Allen,
http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error # 1  Short offline       Completed
without error       00% 17938         -


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 11.0 x86 Kernel 2.6.25.16-0.1-default
up 22:09, 2 users, load average: 0.08, 0.14, 0.13
GPU GeForce 6600 TE/6200 TE - Driver Version: 177.80