smart says SSD is Pre-fail or Old_age

Christophe_deR · November 4, 2017, 12:35pm

I had configured smart to do every day selftests.
It now sends me everyday a mail containing this :

This message was generated by the smartd daemon running on:

   host name:  xxx
   DNS domain: xxx

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], Self-Test Log error count increased from 11 to 12

Device info:
M4-CT128M4SSD2, S/N:000000001238091641E5, WWN:5-00a075-1091641e5, FW:000F, 128 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Fri Oct 20 09:03:55 2017 CEST
Another message will be sent in 24 hours if the problem persists.

Here the smart detailed report :

--root@xxx 10:24:18 /home/christophe] smartctl -A /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.92-18.36-default] (SUSE RPM)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       14013
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       5220
170 Grown_Failing_Block_Ct  0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   001    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
173 Wear_Leveling_Count     0x0033   088   088   010    Pre-fail  Always       -       375
174 Unexpect_Power_Loss_Ct  0x0032   100   100   001    Old_age   Always       -       153
181 Non4k_Aligned_Access    0x0022   100   100   001    Old_age   Always       -       413 75 337
183 SATA_Iface_Downshift    0x0032   100   100   001    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       24
188 Command_Timeout         0x0032   100   100   001    Old_age   Always       -       0
189 Factory_Bad_Block_Ct    0x000e   100   100   001    Old_age   Always       -       82
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       0
195 Hardware_ECC_Recovered  0x003a   100   100   001    Old_age   Always       -       534
196 Reallocated_Event_Count 0x0032   100   100   001    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   001    Old_age   Always       -       0
202 Perc_Rated_Life_Used    0x0018   088   088   001    Old_age   Offline      -       12
206 Write_Error_Rate        0x000e   100   100   001    Old_age   Always       -       0

What should i do ?

Isn’t there a trim utility somewhere that would cure this problem ?

Should i use some vendor utility to repair this ?

tannington · November 4, 2017, 1:45pm

That report doesn’t look too bad… I’ve seen a lot worse

Pre-fail and Old_age are the category of error, it’s not indicating either are imminent.

Ignore any RAW values, as they’re vendor specific there’s no way to interpret them (which is why smartmon tools maintains a drive database with interpretations for those values). You can find a description of the meanings of many of the flags here: http://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes

What’s actually being recorded in the Self Test log?

smartctl -l selftest /dev/sdX

malcolmlewis · November 4, 2017, 2:13pm

On Sat 04 Nov 2017 11:36:01 AM CDT, Christophe deR wrote:

I had configured smart to do every day selftests.
It now sends me everyday a mail containing this :

<snip>

What should i do ?

Isn’t there a trim utility somewhere that would cure this problem ?

Should i use some vendor utility to repair this ?

Hi
Why a test everyday, why at all unless there is an issue?

Balance and trim are taken care of automagically if running btrfs (and
others?).

–
Cheers Malcolm °¿° SUSE Knowledge Partner (Linux Counter #276890)
openSUSE Leap 42.2|GNOME 3.20.2|4.4.90-18.32-default
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below… Thanks!

gogalthorp · November 4, 2017, 4:11pm

Note that smart may record it’s data in ssd flash thus increasing ware. Some SSD’s ship with smart off for that reason. Just saying

OrsoBruno · November 4, 2017, 6:57pm

What’s the matter? Unless I’m getting blind I see that most parameters are still at 100%. The only two that aren’t are “173 Wear_Leveling_Count” and “202 Perc_Rated_Life_Used” both at 88%.
If I’m still good at math, that translates to something like more than 13 years of life remaining, assuming 24/7 operation. Or, your system is likely to go to the junkyard before the disk is worn out at your current usage.

So disable the daily selftest and stop worrying …

tsu2 · November 4, 2017, 8:01pm

“Pre-fail” and “Old Age” are <types> of counters, not the status.
So, you’re completely mis-reading the meaning (there is no meaning, they’re categories of counters and have the same importance if called “Bob” and “Joe.”

Similarly,
The numbers that are displayed like 100 and 88 don’t really have any meaning, I don’t think they necessarily mean percentage of something(some might be, but again… usually not important).

A lot of what is displayed in a smartmontools report has little or no importance. Sure, it’s interesting to know how many hours it’s been powered on, and how many lifetime writes… Those are things I look at only when purchasing a used disk to guestimate the amount of wear and tear the disk has already undergone(You can’t trust what a Seller says, only what is reported by smartmontools because it can’t be altered except by the factory).

More importantly and the only real indication of imminent disk failure is bad sectors, and if you are able to compare results over a few weeks whether the number of bad sectors are increasing.

And <then> if you have a disk that’s going bad,

Get your data off that disk ASAP.
Research your disk and disk manufacturer. For example Seagate(with its less than sterling reliability ratings) has a Seatools utility that can try to repair bad sectors by first trying to write to those sectors and either report as healthy and usable or mapped out permanently (All disk drives have a reserve of sectors).

What I’ve described applies to both HDD and SSD, the latter because of its HDD layout emulation.

As for the stuff about trim, that’s something else that’s only tangentially relevant by having something to do with SSD traps. But, nothing to do with what is in your smartmontools report.

TSU

Christophe_deR · November 4, 2017, 10:35pm

smartctl -l selftest /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.92-18.36-default] (SUSE RPM)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     14011         4574456
# 2  Short offline       Completed: read failure       90%     14008         4574456
# 3  Short offline       Completed: read failure       90%     13994         4574456
# 4  Short offline       Completed: read failure       90%     13986         4574456
# 5  Short offline       Completed: read failure       90%     13976         4574456
# 6  Extended offline    Completed: read failure       90%     13972         4574456
# 7  Short offline       Completed: read failure       90%     13966         4574456
# 8  Short offline       Completed: read failure       90%     13962         4574456
# 9  Short offline       Completed: read failure       90%     13955         4574456
#10  Short offline       Completed: read failure       90%     13953         4574456
#11  Short offline       Completed: read failure       90%     13948         4574456
#12  Extended offline    Completed: read failure       90%     13939         4574456
#13  Short offline       Completed: read failure       90%     13938         4574456
#14  Short offline       Completed: read failure       90%     13924         4574456
#15  Vendor (0xff)       Completed without error       00%      8338         -
#16  Extended offline    Completed without error       00%      3569         -
#17  Extended offline    Completed without error       00%      3560         -
#18  Extended offline    Completed without error       00%      3548         -
#19  Extended offline    Completed without error       00%      3532         -
#20  Extended offline    Completed without error       00%      3524         -
#21  Extended offline    Completed without error       00%      3513         -

Christophe_deR · November 4, 2017, 10:41pm

I had btrfs until 2 weeks ago and the system failed.

Btrfs with 37GB for a system is danger and i had to report previous problems of low space and very frequent system freeze and failures (see my previous topics)

I had to re-install the whole system and i lost some datas, not vital but important.

I swore at many people during these bad days.

I used ext4 for my new install (btrfs = never again).

But i have these strange smart reports that i did not have before …
Due to btrfs over strain ?
Or disk failures caused btrfs mess ?
Don’t know.

Is trim and balance necessary for ext4 ?

Christophe_deR · November 4, 2017, 10:46pm

I would to thank all of you for your answers and i will keep an eye on this hard disk, and keep you informed.

malcolmlewis · November 4, 2017, 11:23pm

Hi
How is the SSD connected, if in a laptop, does it have a cable/header or just a header connection. Sometimes cable connections can cause things like freezes, check the system logs as well.

Here are my four, all run btrfs… no issues;


OCZ-AGILITY3 9 Power_On_Hours_and_Msec 40,492
OCZ-VERTEX4 9 Power_On_Hours 29,106
OCZ-VERTEX460A 9 Power_On_Hours 13,307
SanDisk SDSSDXPS240G 9 Power_On_Hours 3,483

You should also consider planning to update Leap 42.3 in the next few months…

What are the current mount options for your device from the following;


mount

For trim, what is the status of (run as root user);


systemctl status fstrim.timer

gogalthorp · November 5, 2017, 12:28am

SSDs fail differently from spinning rust. With HDD you will most likely see bad sectors before complete failure. With SSD You will run out of memory pool and not be able to write or it just stops working. So the fail modes are not the same as a HDD. With smart set on there are more writes and it is writing that will kill a SDD

Christophe_deR · November 5, 2017, 1:58pm

Thanks :

--root@xxx 13:56:36 /home/christophe] mount
....
/dev/sda4 on / type ext4 (rw,relatime,data=ordered)
.....


--root@xxx 13:56:46 /home/christophe] systemctl status fstrim.timer
● fstrim.timer - Discard unused blocks once a week
   Loaded: loaded (/usr/lib/systemd/system/fstrim.timer; enabled; vendor preset: enabled)
   Active: active (waiting) since dim. 2017-11-05 08:05:59 CET; 5h 51min ago                                                                                                                                                    
     Docs: man:fstrim                                                                                                                                                                                                           
                                                                                                                                                                                                                                
nov. 05 08:05:59 diesel systemd[1]: Started Discard unused blocks once a week.

tsu2 · November 6, 2017, 3:06am

IMO
Most of any questions about whether a failing drive is described in the following
https://www.thomas-krenn.com/en/wiki/Analyzing_a_Faulty_Hard_Disk_using_Smartctl

And,
numerous forum threads by others who have asked the same question.

First,
It should be understood that a few bad HDD blocks are not unusual, and ordinarily the disk will map out the bad blocks to good blocks in the disk’s reserve.
On an SSD however, I’m not as certain how expected bad “blocks” should be since the concept of a block is a virtual layout on top of a different physical structure. But, assuming that if faulty traps can be identified, in the same way they should be mapped out automatically.

There are a few things I’d note in this thread…
The original and full selftest output has never been posted, so who knows whether the tests might be valid?
Whether the @OP intended to or not, the disk model was posted which definitely identifies the disk as an SSD.

The Wiki reference describes how to handle suspected bad blocks… Basically, you just force a write to that sector to force the disk firmware to do it’s thing (mapping out the bad block). In practice, people often use the dd command to zero out all free space which pretty much ensures writing to every available block. Then, run the selftest again to verify the block(s) aren’t still reported bad. I don’t know though that forcing all bad blocks to be mapped out immediately is necessary unless you just want peace of mind and that extra assurance that the disk is good… It probably is not any better than just allowing your normal disk activity to eventually encounter the bad blocks and then your disk will address the matter then. As i described in my earlier post, probably the most important thing to note is the bad block count reported.

TSU

OrsoBruno · November 6, 2017, 9:55am

Christophe_deR:

–root@xxx 13:56:46 /home/christophe] systemctl status fstrim.timer
● fstrim.timer - Discard unused blocks once a week
Loaded: loaded (/usr/lib/systemd/system/fstrim.timer; enabled; vendor preset: enabled)
Active: active (waiting) since dim. 2017-11-05 08:05:59 CET; 5h 51min ago
Docs: man:fstrim

nov. 05 08:05:59 diesel systemd[1]: Started Discard unused blocks once a week.

AFAIK balance is not necessary for EXT4.
Trim is needed for an SSD even with EXT4, but apparently you already have fstrim started once a week, which is more than enough unless your system is a heavy duty database server or the like. As an alternative, for a laptop say, you could start fstrim on each reboot.
Like many other threads on the Forum suggest, I bet that your btrfs problems were just “disk full” problems and not “bad disk” or “bad sectors” related problems, so not really a “disk failure” but possibly a “faulty btrfs maintenance” problem.
Btrfs does strain the disk a bit more than EXT4, but not to a point to be worried about with modern SSDs. Anyway you are now on EXT4, so don’t worry (but Malcolm could say the same with correctly maintained btrfs…).

jdd · November 6, 2017, 11:08am

Le 06/11/2017 à 09:56, OrsoBruno a écrit :

> Btrfs with 37GB for a system is danger and i had to report previous
> problems of low space and very frequent system freeze and failures (see
> my previous topics)

BTRFS is still work in progress, and for example, yast install reduce
the snapshot number in case of small space on disk on recent versions

The kernel (or the system, I don’t know) is able to see that the disk is
a ssd and do it’s job accordingly, I never do any special thing, sure
than I’m not smart enough for this.

balance is a btrfs only thing, not ext4

jdd

Christophe_deR · November 6, 2017, 1:05pm

Thanks again to everyone.
I appreciate your concern and your inputs.
I think i will wait and see.
I am prepared to have system failure at anytime and i do a daily copy/save of my work and personal datas in case my system suddenly fails.

tsu2 · November 6, 2017, 4:10pm

I agree that from the problem description,
The problem was most likely a problem of the disk filling up with snapshots (One of these days, someone will write a snapper configuration that adds a parameter taking into account available free disk space).

As for the @OP’s questions about things like balance and trim (both which need to be manually configured for both BTRFS and EXT4), I’ve posted in other openSUSE forums links to the two following Arch Wiki articles which apply completely to openSUSE as well (all versions) Re-read these articles every time you install a new system, the information is always being updated to whatever is current.

The main **SSD Wiki **
On a new install, I always recommend skipping to the last section first to see if you need to install a firmware upgrade. After that, configure FSTAB for trim (IMO better than a CRON job), and modify your Disk Queueing algorithm. Everything else has less of an impact than these mentioned. If you do run a CRON job, my recommendation is every 48 hrs or as OrsoBruno suggests on boot if you boot at least every other day. The more full your disk is, the more often you need to run trim, if your disk space is hardly used then you can run trim far less often… But in general it’s better to run too often than not often enough (which is why I recommend the FSTAB configuration).

https://wiki.archlinux.org/index.php/Solid_State_Drives

The general** Optimization wiki**. Lots of SSD stuff, and be sure to read the “Talk” for disputes

https://wiki.archlinux.org/index.php/improving_performance

HTH,
TSU

malcolmlewis · November 6, 2017, 4:38pm

tsu2:

I agree that from the problem description,
The problem was most likely a problem of the disk filling up with snapshots (One of these days, someone will write a snapper configuration that adds a parameter taking into account available free disk space).

As for the @OP’s questions about things like balance and trim (both which need to be manually configured for both BTRFS and EXT4), I’ve posted in other openSUSE forums links to the two following Arch Wiki articles which apply completely to openSUSE as well (all versions) Re-read these articles every time you install a new system, the information is always being updated to whatever is current.

The main **SSD Wiki **
On a new install, I always recommend skipping to the last section first to see if you need to install a firmware upgrade. After that, configure FSTAB for trim (IMO better than a CRON job), and modify your Disk Queueing algorithm. Everything else has less of an impact than these mentioned. If you do run a CRON job, my recommendation is every 48 hrs or as OrsoBruno suggests on boot if you boot at least every other day. The more full your disk is, the more often you need to run trim, if your disk space is hardly used then you can run trim far less often… But in general it’s better to run too often than not often enough (which is why I recommend the FSTAB configuration).

Solid state drive - ArchWiki

The general** Optimization wiki**. Lots of SSD stuff, and be sure to read the “Talk” for disputes

Improving performance - ArchWiki

HTH,
TSU

Hi
The current openSUSE install sets up udev rules for I/O scheduler (cfq/deadline *) depending on spinning rust or SSD, check the mount options many optimizations are done by default, then add additional as required to fstab. Trim is taken care of by the systemd service.

If running btrfs then check the maintenance systemd service has been run.

If using btrfs and select a partition size =< 20G then snapper is disabled.*