A hard disk may be failing

Hello. This is my first post here. I have a Compaq nc6000 laptop on witch i decided to install some unix-like os. Ok. From many unsuitable distros for a laptop i came down to OpenSuse. Ok. After the install, it sais “A hard disk may be failing” blah blah blah and if i run the smart test it come with the same error. So, i deleted all my ext4/swap partitions and create with fdisk a fat16 part, witch i formatted. Using scandisk, norton disk doctor (all from dos) and the bios tool for checking the surface (this doesn’t even need for the hdd to be partitioned) it came that my hdd if perfectly sane. No bad’s. No even when i formatted, it didn’t said i have any bad sectors. (just right befor suse i had xp and i did’t experienced any data loss. What is the problem?

It’s possible to get a false positive, but quite rare.
There do not absolutely have to be bad sectors for a hard disk to be on the way out.

If you want to take your chances, sure, go ahead – but don’t be surprised if it fails down the track. And if you continue to use the hdd, make very sure to back up any important data.
You could also seek the manufacturer’s diagnostic tools from their website and run those over the hard drive.

On 2010-12-07 05:36, editheraven wrote:

> What is the
> problem?

Not all problems are surface errors. The smartctl program would have
printed a full report if you asked.

Or, you can download the test program from the HD manufacturer (typically a
boot floppy/cd image you burn yourself) and run the test.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

Trust smartctrl - really.

Here is an article I wrote describing how to use smartctl and some key attributes to look for:
SMART Drive Diagnostics - Lyceum

If you have any - even 1 - pending or reallocated sector, this alone can be enough to cause you all manner of trouble.

Also, try testing it with hdparm -Tt /dev/sda and check the speed.

Cheers,
Lews Therin

Yes, but the only problem found by that program was that i had too many bad sectors. The others were good.

One experience in denying warnings can be enough: A customer lost 400GB of music (a DJ …), following the advice to ‘ignore all the moaning from smartd’, instead of backing up+replacing the disk (my advice). A couple of minutes before this happened, there were ‘absolutelely no issues on XP, except for some desktop items missing’. This example shows the risk you’re taking. Read Lews Therin’s article.
If I were you, I’d immediately get to work to get the data copied from the disk

Yes, but the only problem found by that program was that i had too many bad sectors. The others were good.

Very funny, made my day, excellent sense of humour

No. I mean every other program that scanned for bad sectors found none. Only the monitoring program under suse tells me that. That was my point.

On 2010-12-08 15:06, editheraven wrote:
>
> No. I mean every other program that scanned for bad sectors found none.
> Only the monitoring program under suse tells me that. That was my point.

That’s normal and to be expected.

The HD has some space reserved in advance to remap bad sectors; this is
done by the HD, not the operating system. When the OS checks again, it sees
no bad sectors, because they have been remapped. When you try to access
those, the disk head jumps to the factory reserved sectors and reads there.
It is transparent.

On remapping, I understand that the HD tries to copy the data, but it could
be damaged and the content can be garbage.

You can only see that remapping and other errors looking at the SMART data.
Tools like those you mentioned do nothing. It is important that you pay
attention and use the tools we told you.

A few bad sectors are to be expected⁽¹⁾, it depends on the cause. However,
if they continue growing, replace the disk.

(1) I have a disk (or two) around which developed some bad sectors a few
months after I bought them. I have, after that, used that disk for around
20KHours more, without any further problem.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

Another way to view / monitor your smart attributes:

zypper install gnome-disk-utility

Then, run palimpsest and there is a button to “view smart attribute” - you can also test the drive performance. This is really no different than the smartctrl commands referenced earlier - just a gui for them.

As others have suggested it is just a matter of learning how to interpret the smart attributes, and knowing when “bad” is really “bad”. This is far better than relying on any utility to tell you this, as you will know yourself what the real deal is.

Lews Therin

I used to argue with disk utilities that claimed bad sectors, etc. especially when I could produce contradictory evidence by running other utilities. Then I happened to get seated next to an engineer from one of the premier disk manufacturers at some conference or other. Over lunch he gave me a detailed explanation of many different ways in which the surface of a disk can break down. The stuff he told me was incredibly detailed, and my brain’s aged too much since then for me to be able to recall much more than the bottom line which was “if you’re seeing any more errors than you did the last time you checked, it’s not a question of IF the drive will fail it’s WHEN.” When it does die it may go gradually, even over a period of months or years, or it can happen without warning in an instant.

Ever since then, whenever I see bad sectors, etc., I pull the drive and either discard it or put it in a non-critical sandbox PC.

Hey Lews, I just installed the utility for a look see – it’s great (but scary).

Lews, here is the Smart warning for my main laptop drive, should I be concerned?

http://www.swerdna.net.au/forumpics/smart.png

On 12/08/2010 08:06 PM, swerdna wrote:
>
> Lews, here is the Smart warning for my main laptop drive, should I be
> concerned?
>
> [image: http://www.swerdna.net.au/forumpics/smart.png]

I’m not Lews, but I would be worried only if the count of 2 starts increasing.
Those 2 are probably original as it is very difficult to make platters without
some defects. How old is that drive, and how many head load/unload cycles has it
undergone?

What Robin said is correct. Disks have a processor onboard that do some smart things (no pun intended). They can conceal a certain number of bad blocks, in fact some may already be remapped out of the factory as lwfinger points out. So a check from the OS shows nothing bad, but asking the disk directly with smartctl gives you all the gory details. And yes, sometimes you get warnings by rising bad block counts, but you can get a catastrophic failure. As always, backup anything of value to limit your loss.

It can be confusing, having to deal with smartness at various levels in the chain and knowing which level you are communicating with.

Approx 450 days old.
Power on hours = 321 days
Load/unload cycle count = 862000

All attributes are either “good” or “N/A” except for the one “warning” attribute in the screenshot (reallocated sector count = 2)

Hey Swerdna,

There is cause to be a little concerned, but it is hard to tell just from that. Of course one can argue “there are millions of sectors on today’s platters - two failed means nothing.” However, in practice this is not a correct argument. The alternate argument would be: many drives operate for 40,000+ hours and have zero reallocated sectors, so the fact that you see any is a bad sign.

From a practical stand point, I’ve worked in data centers where we had several thousand servers. Over a period of a few years, as drive capacities continued to increase (and perhaps overall quality decreased) we started seeing more drives develop bad sectors earlier, and fail sooner. It became painfully obvious when 320GB drives where the “new” ones - out of an order of say two dozen drives, easily 4 or 5 would fail on burn in testing - regardless of vendor or model. They just plain sucked. (We would benchmark every new drive, run a full dd / dban on it, then bench it again. During this process alone, we would see immediate failures, some times with a few sectors, some times with hundreds.) Then the 500GB drives came out, and this improved. Of course each new generation brings new things - vertical particle alignment then came along, again increasing density. Now, with 2TB drives we are starting to see the first 4k sector Advanced Format drives, which yields a bit better density, and much better (and more efficient) CRC checks, etc.

The bottom line is with 2 sectors, is means just that - a write operation to two sectors failed, resulting in them being reallocated. (A read operation that fails marks the sector as pending reallocation. So if you see either the “pending reallocation” or “reallocated” values increase any more, I would really suggest you just replace it now and dd it - save yourself the trouble of a failure.

You can run smartclt -a /dev/sda and look at the “error log” section. With two reallocated you will most likely have some entries in the log, which (should) also give the time (power on hours) when they occurred. By comparing this to the power on hours (attribute 9 usually) in the attributes table, you can see how long ago those errors occurred. That lets you know if you have had any recent issues.

Also, you can have the drive run a smart test on itself, and this is certainly worth doing:

smartctl -t long /dev/sda

After that completes, again check the error and test logs with smartctl -a and see what the results where.

Finally, test the performance with smartctl -tT /dev/sda and see what you get.

With the results of the above, you will have a very good idea of what the current health of the drive really is.

Then, of course, monitor it - you can use smartd and configure that scripts it comes with to notify you. Here are a few notes I have on uses those scripts:
Smartd Drive Monitoring - Lyceum
And of course man smartd is great.

If you get errors that you have pending sectors, or the smart test fails, it will report the sector number. You can find out what inode is associated with that sector and what (if any) file has that inode, and then force the sector to be written to following this procedure I wrote (mainly just as an exercise to understand this process better):
SMART Rewriting Bad Sectors - Lyceum

That’s sort of an overkill answer, but hard drive health and smart issues are an area of interest for me!

Cheers,
Lews Therin

I just saw you post on the power on hours: 321 days would be about 7,704 hours (if running 24 x 7). Just as an aside, for the drive failure data we collected in the data center we noticed a dramatic increase in failure incidence as when power on hours got to 35 - 40,000. We would retire drive at 45,000 hours as it just was not worth keeping them in service, as statistically they would very likely fail in the next 5,000 hours.

But they can fail with 1 hour on them too - so this is just another piece of data. It is quite common to have failures occur < 500 hours and > 45,000.

Isn’t it interesting that a five year warranty = 43,800 hours? Hehe . . .

Lews Therin

LewsTherinTelemon wrote:

> There is cause to be a little concerned, but it is hard to tell just
> from that. Of course one can argue “there are millions of sectors on
> today’s platters - two failed means nothing.” However, in practice this
> is not a correct argument. The alternate argument would be: many drives
> operate for 40,000+ hours and have zero reallocated sectors, so the fact
> that you see any is a bad sign.

Good thread!

After fighting disk drives since the days when IBM literally used hydraulics
on the head carriers and two essential tools were a Crescent wrench and a
wipe rag, it has been common to see some initial bad sectors due to manf.
defects in the plating or some other operation with a few more cropping up
shortly after entering service as real estate that was marginal in the first
place but missed on QC or incoming inspection. Those type error sectors are
not a huge concern if the number stabilizes at a small value after several
days/weeks of run in. Ignoring the electronics issues, mechanical failures
in rotating disks (assuming you’re not dribbling them like a basketball)
generally result from wear on the moving parts which reduces already
ridiculous tolerances until you eventually get a head crash generating
debris inside the case. Failure from that point on is pretty much a cascade
event as each event increases the likelihood of another. When a drive
suddenly starts to show remapped sectors after initial run in, it’s time to
replace it BEFORE it craps out because it probably will - in short order.

OTOH, sudden failures tend to be electronic in nature - I always made a
point of sticking to a single source or at least a limited number of
suppliers. You’d be surprised at how many dead drives came alive as soon as
you swapped in a good controller board so it’s nice to have a couple of
spares of the same models as what you have in use. Much cheaper than
recovery shops! There will always be the catastrophic failure that leaves
you with a steaming pile but Lews gives really good advice on what I would
call best practice.

Considering the mechanics involved, 45k hours is actually pretty amazing to
me. Not many rotating machines can claim that kind of life, especially at
the speeds involved in modern drives.


Will Honea

On 2010-12-09 05:06, swerdna wrote:

> Approx 450 days old.
> Power on hours = 321 days

Ie, 7704 hours. That’s not much.

> Load/unload cycle count = 862000

That’s a lot. No, wait… What is that one? No, I thought it was
“Power_Cycle_Count”. I have “Start_Stop_Count”.

> All attributes are either “good” or “N/A” except for the one “warning”
> attribute in the screenshot (reallocated sector count = 2)

You can get another clue from “smartctl -a /dev/whatever”. It displays all
that, plus a log of the last errors and the hour at which they happened. If
you see that the last error occurred at 300 hours, and the disk has 3000,
then it is irrelevant, it happened long ago. If the errors were at 2950,
then watch the disk closely.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)