Testing hard disk. Failure or not?

Hi!

I have a opensuse server running 24/7.
Some times it unmount filesystems on sdb.
sdb disk is a sata one connected to the motherboard through a pci card (delock 70156 PCI-> SATA + eSATA + IDE) because the motherboard is old and does not have sata.

Whe the disk is unmounted it “dissapears”. Executing fdisk -l reports no sdb disk.

But rebooting the computer mount the disk again.

This are some errors in /var/log/messages:




Nov 11 20:41:47 aldebaran kernel: sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 11 20:42:18 aldebaran kernel: sd 2:0:0:0: [sdb] 488397168 512-byte hardware sectors: (250 GB/232 GiB)
Nov 11 20:42:49 aldebaran kernel: sd 2:0:0:0: [sdb] Write Protect is off
Nov 11 20:42:49 aldebaran kernel: sd 2:0:0:0: [sdb] Mode Sense: 00 3a 00 00

.........


Nov 11 21:03:03 aldebaran smartd[3311]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 42 to 41 
Nov 11 23:03:02 aldebaran smartd[3311]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 78 to 79 
Nov 11 23:03:02 aldebaran smartd[3311]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 22 to 21 

.....

Nov 11 23:43:37 aldebaran kernel: sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 11 23:44:49 aldebaran kernel: sd 2:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Nov 11 23:44:49 aldebaran kernel: sd 2:0:0:0: [sdb] Sense Key : Aborted Command [current] [descriptor]
Nov 11 23:44:49 aldebaran kernel: sd 2:0:0:0: [sdb] Add. Sense: No additional sense information
Nov 11 23:44:49 aldebaran kernel: end_request: I/O error, dev sdb, sector 16515215
Nov 11 23:44:49 aldebaran kernel: end_request: I/O error, dev sdb, sector 243930255
Nov 11 23:44:49 aldebaran kernel: Buffer I/O error on device sdb1, logical block 30491274
Nov 11 23:44:49 aldebaran kernel: lost page write due to I/O error on sdb1
Nov 11 23:44:49 aldebaran kernel: Aborting journal on device sdb1.
Nov 11 23:44:49 aldebaran kernel: JBD: Detected IO errors while flushing file data on sdb1
Nov 11 23:44:49 aldebaran kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal
Nov 11 23:44:49 aldebaran kernel: EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=516231, block=2064394
Nov 11 23:44:49 aldebaran kernel: EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=516231, block=2064394
Nov 11 23:44:49 aldebaran kernel: EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=509145, block=2031695
Nov 11 23:44:49 aldebaran kernel: EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=516228, block=2064394

........


Nov 11 23:44:50 aldebaran kernel: EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=509140, block=2031695
Nov 11 23:44:50 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #778270 offset 0
Nov 11 23:44:50 aldebaran kernel: sd 2:0:0:0: [sdb] Synchronizing SCSI cache
Nov 11 23:44:50 aldebaran kernel: sd 2:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Nov 11 23:44:51 aldebaran kernel: sd 2:0:0:0: [sdb] Stopping disk
Nov 11 23:44:51 aldebaran kernel: sd 2:0:0:0: [sdb] START_STOP FAILED
Nov 11 23:44:51 aldebaran kernel: sd 2:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Nov 11 23:44:54 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #12255659 offset 0
Nov 11 23:46:09 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #2384311 offset 0
Nov 11 23:47:09 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #1507333 offset 0
Nov 11 23:47:09 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #1507333 offset 0

........

Nov 11 23:48:10 aldebaran kernel: EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=14491649, block=57966594
Nov 11 23:48:10 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #13656176 offset 0
Nov 11 23:48:10 aldebaran kernel: EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=14491649, block=57966594
Nov 11 23:48:53 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #2384325 offset 0
Nov 11 23:48:53 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #2392094 offset 0
Nov 11 23:48:53 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #2392094 offset 0
Nov 11 23:48:53 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #2384296 offset 0
Nov 11 23:48:53 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #2392065 offset 0
Nov 11 23:48:53 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #2384311 offset 0
Nov 11 23:52:36 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #2392094 offset 0
Nov 11 23:53:53 aldebaran kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #2384325 offset 0
Nov 12 00:03:03 aldebaran smartd[3311]: Device: /dev/sdb [SAT], open() failed: No such device 


SMART is enabled on disk, but I can’t see errors with it:


aldebaran:/etc/cron.daily # smartctl -a /dev/sdb
smartctl 5.39 2008-10-24 22:33 [i686-suse-linux-gnu] (openSUSE RPM)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST3250318AS
Serial Number:    9VM83K0Y
Firmware Version: CC38
User Capacity:    250,059,350,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Mon Nov 12 11:45:38 2012 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 246) Self-test routine in progress...
                                        60% of test remaining.
Total time to complete Offline 
data collection:                 ( 592) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  42) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   115   099   006    Pre-fail  Always       -       93972657
  3 Spin_Up_Time            0x0003   099   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       700
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       13041630
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7795
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       348
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Unknown_Attribute       0x0032   100   099   000    Old_age   Always       -       65537
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   071   048   045    Old_age   Always       -       29 (Lifetime Min/Max 18/29)
194 Temperature_Celsius     0x0022   029   052   000    Old_age   Always       -       29 (0 15 0 0)
195 Hardware_ECC_Recovered  0x001a   040   030   000    Old_age   Always       -       93972657
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   179   000    Old_age   Always       -       465
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       104732777521361
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       2835927611
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       4151439007

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 60%      7795         -
# 2  Short offline       Completed without error       00%      7794         -
# 3  Short offline       Completed without error       00%      1346         -
# 4  Short offline       Completed without error       00%      1345         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


maybe is just pci-controller error?

How can i check it?

regards

On 11/12/2012 11:56 AM, fperal wrote:
> I have a opensuse server running 24/7.
> Some times it unmount filesystems on sdb.

what version of openSUSE?
please show us the output of

cat /etc/issue


dd

Could be either one of them. If this is a production server, I’d think about replacement. Did you already perform a filesystem check on /dev/sdb1?

fperal wrote:
> Hi!
>
> I have a opensuse server running 24/7.
> Some times it unmount filesystems on sdb.

How often?

> sdb disk is a sata one connected to the motherboard through a pci card
> (delock 70156 PCI-> SATA + eSATA + IDE) because the motherboard is old
> and does not have sata.
>
> Whe the disk is unmounted it “dissapears”. Executing fdisk -l reports
> no sdb disk.

What other disks are there on the system? Does sdb perhaps reappear as
sdc or sdd or whatever? Not that it will help a great deal :frowning:

> But rebooting the computer mount the disk again.
>
> This are some errors in /var/log/messages:

> maybe is just pci-controller error?

That seems like a good possibility. I’m not an expert but the errors
don’t look like ata errors so I’d think it was most likely the
controller, or the PCI bus or a power supply issue.

> How can i check it?

All I can think of is to change the hardware. Swap the controller if
possible, or try the disk in another machine.

Also as dd asked, what version OS?

Welcome to openSUSE 11.1 - Kernel \r (\l).

Could be either one of them. If this is a production server, I’d think about replacement. Did you already perform a filesystem check on /dev/sdb1?

Oooow, that’s old. Support for 11.1 ceased a long time ago.

The server is a “home server” I have a lot of documentation in it with periodic backups.

I have done fsck /dev/sdb1 with no errors, buts I’ve just force a recheck with fsck -f and i has reported some errors: during step 1 some inode_isize, some inode i_block , during step 5 qa bit map block error and a free block count error. I’ve fixed them all

maybe 2-3 times a month

> sdb disk is a sata one connected to the motherboard through a pci card
> (delock 70156 PCI-> SATA + eSATA + IDE) because the motherboard is old
> and does not have sata.
>
> Whe the disk is unmounted it “dissapears”. Executing fdisk -l reports
> no sdb disk.

What other disks are there on the system? Does sdb perhaps reappear as
sdc or sdd or whatever? Not that it will help a great deal :frowning:

There is also sda, which is a IDE disk connected to the motherboard it reports no errors

sdb does not reappear

> But rebooting the computer mount the disk again.
>
> This are some errors in /var/log/messages:

> maybe is just pci-controller error?

That seems like a good possibility. I’m not an expert but the errors
don’t look like ata errors so I’d think it was most likely the
controller, or the PCI bus or a power supply issue.

> How can i check it?

All I can think of is to change the hardware. Swap the controller if
possible, or try the disk in another machine.

Also as dd asked, what version OS?

11.1

regards

Knurpht wrote:
> Could be either one of them.

Either one of what?

fperal wrote:
> dd;2503259 Wrote:
>> On 11/12/2012 11:56 AM, fperal wrote:
>>> I have a opensuse server running 24/7.
>>> Some times it unmount filesystems on sdb.
>> what version of openSUSE?
>> please show us the output of
>>
>> cat /etc/issue
>>
>> –
>> dd
>
>
> Welcome to openSUSE 11.1 - Kernel \r (\l).

There have been a lot of improvements to SATA support in the kernel
since then. So although I still think that a controller hardware failure
is most likely, it’s possible you’re hitting some bug in its driver.

So as well as changing the hardware,
I’d suggest upgrading the software :slight_smile:

On 2012-11-12 11:56, fperal wrote:
>
> Hi!
>
> I have a opensuse server running 24/7.

And 11.1, which is also old…

I would at least try a live of a newer version. 11.4 at least (which is
also old, but is not systemd).

> Whe the disk is unmounted it “dissapears”. Executing fdisk -l reports
> no sdb disk.
>
> But rebooting the computer mount the disk again.

And disconnect/reconnect cable? Or power off/on the disk.

> This are some errors in /var/log/messages:

You have I/O errors. Nasty. I would start suspecting cabling.
It disapears because of those errors, I think.

> SMART is enabled on disk, but I can’t see errors with it:

But you have not completed any long test.

>
>
> Code:
> --------------------
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
> # 1 Extended offline Self-test routine in progress 60% 7795 -
> # 2 Short offline Completed without error 00% 7794 -
> # 3 Short offline Completed without error 00% 1346 -
> # 4 Short offline Completed without error 00% 1345 -
> --------------------

You might try the external test of seatols (seagate) or equivalent. It
says that it tests the cable.

>
>
>
> maybe is just pci-controller error?

Maybe.

> How can i check it?

By testing a different card (different make).


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” (Minas Tirith))

On 2012-11-12 13:36, Dave Howorth wrote:
> So as well as changing the hardware,
> I’d suggest upgrading the software :slight_smile:

Indeed. That’s why I suggested trying with a live CD before doing the
upgrade. :slight_smile:


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” (Minas Tirith))

Carlos E. R. wrote:
> On 2012-11-12 11:56, fperal wrote:
>> But rebooting the computer mount the disk again.
>
> And disconnect/reconnect cable? Or power off/on the disk.

It’s an internal disk, IIUC.

>> This are some errors in /var/log/messages:
>
> You have I/O errors. Nasty. I would start suspecting cabling.
> It disapears because of those errors, I think.

It’s a PCI-card contoller connected to a * SATA disk.
If the error were in the SATA cable or the disk power cable, the errors
would be logged by the ata subsystem and these are not.
*

On 2012-11-12 14:47, Dave Howorth wrote:
> It’s a PCI-card contoller connected to a * SATA disk.
> If the error were in the SATA cable or the disk power cable, the errors
> would be logged by the ata subsystem and these are not.

Mmm. Maybe in the rest of the logs.


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” (Minas Tirith))
*