Hard disk shut down

Hi!

I’ve a computer running 24-7.
Two hard disks: one IDE (sdb) one SATA (sda)
Recently I’ve had three hard disk stops on sda ( with just one partition: sda1 = /home )

Here is the log of the last one


Mar 20 03:14:14 aldebaran kernel: [113147.280104] sd 0:0:0:0: [sda]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 20 03:14:14 aldebaran kernel: [113147.280111] sd 0:0:0:0: [sda]  Sense Key : Aborted Command [current] [descriptor]
Mar 20 03:14:14 aldebaran kernel: [113147.280138] sd 0:0:0:0: [sda]  Add. Sense: No additional sense information
Mar 20 03:14:14 aldebaran kernel: [113147.280145] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 04 91 ab df 00 00 08 00
Mar 20 03:14:14 aldebaran kernel: [113147.280157] end_request: I/O error, dev sda, sector 76655583
Mar 20 03:14:14 aldebaran kernel: [113147.280168] Buffer I/O error on device sda1, logical block 9581940
Mar 20 03:14:14 aldebaran kernel: [113147.280172] lost page write due to I/O error on sda1
Mar 20 03:14:14 aldebaran kernel: [113147.280205] sd 0:0:0:0: [sda] killing request
Mar 20 03:14:14 aldebaran kernel: [113147.280251] sd 0:0:0:0: [sda] Unhandled error code
Mar 20 03:14:14 aldebaran kernel: [113147.280255] sd 0:0:0:0: [sda]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 20 03:14:14 aldebaran kernel: [113147.280260] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 00 00 3f 00 00 08 00
Mar 20 03:14:14 aldebaran kernel: [113147.280270] end_request: I/O error, dev sda, sector 63
Mar 20 03:14:14 aldebaran kernel: [113147.280274] Buffer I/O error on device sda1, logical block 0
Mar 20 03:14:14 aldebaran kernel: [113147.280278] lost page write due to I/O error on sda1
Mar 20 03:14:14 aldebaran kernel: [113147.280314] JBD: Detected IO errors while flushing file data on sda1
Mar 20 03:14:14 aldebaran kernel: [113147.280415] Aborting journal on device sda1.
Mar 20 03:14:14 aldebaran kernel: [113147.280438] JBD: I/O error detected when updating journal superblock for sda1.
Mar 20 03:14:14 aldebaran kernel: [113147.750223] sd 0:0:0:0: [sda] Synchronizing SCSI cache
Mar 20 03:14:14 aldebaran kernel: [113147.750321] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mar 20 03:14:14 aldebaran kernel: [113147.750328] sd 0:0:0:0: [sda] Stopping disk
Mar 20 03:14:14 aldebaran kernel: [113147.750347] sd 0:0:0:0: [sda] START_STOP FAILED
Mar 20 03:14:14 aldebaran kernel: [113147.750352] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK



Rebooting the computer solves the problem (the disk starts again)

The disk has SMART enabled and i’ve test it but i can’t see any problem. I’ve just passed a short test and then a long test and this is the result



aldebaran:/home/fernando # smartctl -a /dev/sda
smartctl 6.0 2012-10-10 r3643 [i686-linux-3.4.28-2.20-default] (SUSE RPM)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.12
Device Model:     ST3250318AS
Serial Number:    9VM83K0Y
LU WWN Device Id: 5 000c50 01a3b4444
Firmware Version: CC38
User Capacity:    250.059.350.016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Wed Mar 20 21:11:43 2013 CET

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/213891en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  592) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  42) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       135810410
  3 Spin_Up_Time            0x0003   099   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1200
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       18037715
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       10843
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       604
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       4295098372
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   048   045    Old_age   Always       -       30 (Min/Max 29/35)
194 Temperature_Celsius     0x0022   030   052   000    Old_age   Always       -       30 (0 15 0 0 0)
195 Hardware_ECC_Recovered  0x001a   037   029   000    Old_age   Always       -       135810410
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   179   000    Old_age   Always       -       465
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       133818296053016
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3170061813
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1403077182

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     10843         -
# 2  Short offline       Completed without error       00%     10842         -
# 3  Extended offline    Completed without error       00%      7795         -
# 4  Short offline       Completed without error       00%      7794         -
# 5  Short offline       Completed without error       00%      1346         -
# 6  Short offline       Completed without error       00%      1345         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

may be a controller error?

regards

On Wed, 20 Mar 2013 20:56:02 +0000, fperal wrote:

> may be a controller error?

That’s what it looks like to me - looks a lot like a hardware problem.

Jim


Jim Henderson
openSUSE Forums Administrator
Forum Use Terms & Conditions at http://tinyurl.com/openSUSE-T-C

fperal wrote:
> may be a controller error?

Could be, or a cable problem, or even an intermittent problem on the
disk. Or just a vague possibility of a PSU problem.

I would try the following steps until one fixed the problem:

(1) - confirm that your PSU is chunky enough for whatever you’ve got
plugged in

  • confirm disk firmware and any controller firmware is up to date

(2) unplug and replug the disk cable (and the controller if it is not on
the motherboard)

(3) replace the disk cable

(4) buy a cheap SATA adapter card and use it instead of the existing port

(5) replace the disk

(6) panic :frowning: - replace PSU and/or motherboard!

Hi,

I think I’m having the same troubles here. It is a Dell Latitude E6410 with Crucial M4 SSD 128GB. Everything was working fine for about a year and half and now suddenly this:


Aug 17 17:28:38 jano kernel: ata1.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x6 frozen
Aug 17 17:28:38 jano kernel: ata1.00: failed command: READ FPDMA QUEUED
Aug 17 17:28:38 jano kernel: [135B blob data]
Aug 17 17:28:38 jano kernel: ata1.00: status: { DRDY }
Aug 17 17:28:38 jano kernel: ata1.00: failed command: READ FPDMA QUEUED
Aug 17 17:28:38 jano kernel: [136B blob data]
Aug 17 17:28:38 jano kernel: ata1.00: status: { DRDY }
Aug 17 17:28:38 jano kernel: ata1.00: failed command: READ FPDMA QUEUED
Aug 17 17:28:38 jano kernel: [135B blob data]
Aug 17 17:28:38 jano kernel: ata1.00: status: { DRDY }
Aug 17 17:28:38 jano kernel: ata1.00: failed command: READ FPDMA QUEUED
Aug 17 17:28:38 jano kernel: [136B blob data]
Aug 17 17:28:38 jano kernel: ata1.00: status: { DRDY }
Aug 17 17:28:38 jano kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Aug 17 17:28:38 jano kernel: [136B blob data]
Aug 17 17:28:38 jano kernel: ata1.00: status: { DRDY }
Aug 17 17:28:38 jano kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Aug 17 17:28:38 jano kernel: [136B blob data]
Aug 17 17:28:38 jano kernel: ata1.00: status: { DRDY }
Aug 17 17:28:38 jano kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Aug 17 17:28:38 jano kernel: [136B blob data]
Aug 17 17:28:38 jano kernel: ata1.00: status: { DRDY }
Aug 17 17:28:38 jano kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Aug 17 17:28:38 jano kernel: [137B blob data]
Aug 17 17:28:38 jano kernel: ata1.00: status: { DRDY }
Aug 17 17:28:38 jano kernel: ata1: hard resetting link
Aug 17 17:28:43 jano kernel: ata1: link is slow to respond, please be patient (ready=0)
Aug 17 17:28:48 jano kernel: ata1: COMRESET failed (errno=-16)
Aug 17 17:28:48 jano kernel: ata1: hard resetting link
Aug 17 17:28:53 jano kernel: ata1: link is slow to respond, please be patient (ready=0)
Aug 17 17:28:58 jano kernel: ata1: COMRESET failed (errno=-16)
Aug 17 17:28:58 jano kernel: ata1: hard resetting link
Aug 17 17:29:04 jano kernel: ata1: link is slow to respond, please be patient (ready=0)
Aug 17 17:29:33 jano kernel: ata1: COMRESET failed (errno=-16)
Aug 17 17:29:33 jano kernel: ata1: limiting SATA link speed to 1.5 Gbps
Aug 17 17:29:33 jano kernel: ata1: hard resetting link
Aug 17 17:29:38 jano kernel: ata1: COMRESET failed (errno=-16)
Aug 17 17:29:38 jano kernel: ata1: reset failed, giving up
Aug 17 17:29:38 jano kernel: ata1.00: disabled
Aug 17 17:29:38 jano kernel: ata1.00: device reported invalid CHS sector 0
Aug 17 17:29:38 jano kernel: ata1.00: device reported invalid CHS sector 0
Aug 17 17:29:38 jano kernel: ata1.00: device reported invalid CHS sector 0
Aug 17 17:29:38 jano kernel: ata1.00: device reported invalid CHS sector 0
Aug 17 17:29:38 jano kernel: ata1.00: device reported invalid CHS sector 0
Aug 17 17:29:38 jano kernel: ata1.00: device reported invalid CHS sector 0
Aug 17 17:29:38 jano kernel: ata1.00: device reported invalid CHS sector 0
Aug 17 17:29:38 jano kernel: ata1.00: device reported invalid CHS sector 0
Aug 17 17:29:38 jano kernel: ata1: EH complete
Aug 17 17:29:38 jano kernel: sd 0:0:0:0: [sda] Unhandled error code
Aug 17 17:29:38 jano kernel: sd 0:0:0:0: [sda]  
Aug 17 17:29:38 jano kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 17 17:29:38 jano kernel: sd 0:0:0:0: [sda] CDB: 
Aug 17 17:29:38 jano kernel: Write(10): 2a 00 04 c4 7f 70 00 00 a8 00
Aug 17 17:29:38 jano kernel: end_request: I/O error, dev sda, sector 79986544


# uname -a
Linux jano 3.7.10-1.16-desktop #1 SMP PREEMPT Fri May 31 20:21:23 UTC 2013 (97c14ba) x86_64 x86_64 x86_64 GNU/Linux


# cat /etc/SuSE-release
openSUSE 12.3 (x86_64)
VERSION = 12.3
CODENAME = Dartmouth

Could someone please suggest how can I verify whether my SSD is gone or the controller is? Is it possible that it is a Kernel bug? I think this appear after the Kernel was updated.

Interesting thing is that after a reset everything works fine for several minutes.

I noticed in the smartctrl output:

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/213891e

(Smartctrl is such a wonderful tool.)

Plugging your drive serial and model into the referenced Seagate firmware site page I see there is a CC49 firmware available, upgrading you from CC38.
There is no info on that Seagate page on what the newer firmware(s) address, but you might be able to find release notes someplace.