Tools and methods to check and assess SSD health?

VariableStar · August 26, 2023, 12:41pm

I wonder what tools are available and most commonly used, in Linux in general but openSUSE in particular, to check the health of a computer’s solid state drives.

I have no particular problems at all, just want to know what means are at hand if any to do a health check if the necessity arises.

Thanks in advance

malcolmlewis · August 26, 2023, 1:14pm

@VariableStar Hi, look at using smartctl -a /dev/[sdX,nvmeN]

marel · September 2, 2023, 6:54am

I find smarctl hard to read and prefer using skdump, that gives more guidance on what is going on.

You have to run it as root:

sudo skdump /dev/[sdX,nvmeN]

karlmistelberger · September 2, 2023, 8:33am

Users may want check their drives periodically:

erlangen:~ # systemctl status btrfs-scrub.timer
● btrfs-scrub.timer - Scrub btrfs filesystem, verify block checksums
     Loaded: loaded (/usr/lib/systemd/system/btrfs-scrub.timer; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/btrfs-scrub.timer.d
             └─schedule.conf
     Active: active (waiting) since Thu 2023-08-31 21:14:54 CEST; 1 day 13h ago
    Trigger: Sun 2023-10-01 00:00:00 CEST; 4 weeks 0 days left
   Triggers: ● btrfs-scrub.service
       Docs: man:btrfs-scrub

Aug 31 21:14:54 erlangen systemd[1]: Started Scrub btrfs filesystem, verify block checksums.
erlangen:~ #

On infamous host erlangen btrfs-scrub.service runs monthly.

erlangen:~ # journalctl -b -u btrfs-scrub.service 
Sep 01 03:57:55 erlangen systemd[1]: Started Scrub btrfs filesystem, verify block checksums.
Sep 01 03:57:55 erlangen btrfs-scrub.sh[15554]: Running scrub on /
Sep 01 04:02:04 erlangen btrfs-scrub.sh[15554]: Scrub device /dev/nvme1n1p2 (id 1) done
Sep 01 04:02:04 erlangen btrfs-scrub.sh[15554]: Scrub started:    Fri Sep  1 03:57:55 2023
Sep 01 04:02:04 erlangen btrfs-scrub.sh[15554]: Status:           finished
Sep 01 04:02:04 erlangen btrfs-scrub.sh[15554]: Duration:         0:04:09
Sep 01 04:02:04 erlangen btrfs-scrub.sh[15554]: Total to scrub:   594.07GiB
Sep 01 04:02:04 erlangen btrfs-scrub.sh[15554]: Rate:             2.31GiB/s
Sep 01 04:02:04 erlangen btrfs-scrub.sh[15554]: Error summary:    no errors found
Sep 01 04:02:04 erlangen btrfs-scrub.sh[15554]: flock: getting lock took 0.000003 seconds
Sep 01 04:02:04 erlangen btrfs-scrub.sh[15554]: flock: executing btrfs
Sep 01 04:02:04 erlangen systemd[1]: btrfs-scrub.service: Deactivated successfully.
Sep 01 04:02:04 erlangen systemd[1]: btrfs-scrub.service: Consumed 32.886s CPU time.
erlangen:~ #

See also: Scrub — BTRFS documentation

hui · September 2, 2023, 10:06am

btrfs scrub is useless when you want to check your SSD health (hardware). btrfs scrub is only a filesystem level tool…

karlmistelberger · September 2, 2023, 11:36am

Nope. I don’t agree.

hui · September 2, 2023, 12:29pm

You don’t agree but can’t state any facts. If you would have read your own article you would agree…

It really only checks checksums of data and tree blocks, it doesn’t ensure the content of tree blocks is valid and consistent.

btrfs scrub is not able to monitor hardware health.

If you disagree then explain how you use btrfs scrub to monitor your SSD/HDD health like:

temperatures (min/max/act)
lifetime power-on resets
power on time
power cycle count
number of write commands
number of read commands
total bad blocks
erase count
online/offline short/extented self-tests
error logs
life courve status
media wear out indicator
reserved block count
and many more…

Because above are informations provided by real hardware monitoring (health) tools like smartctl or skdump (and the various GUI apps for it like GSmartControl).

karlmistelberger · September 3, 2023, 7:52am

btrfs scrub checksums all blocks. Issues of the hardware below the file system will result in inconsistent checksums.

Several 100,000 power on hours of HDDs and SSDs in infamous host erlangen and its siblings have shown that hardware problems exist which are not detected by smartctl and others. The Swiss cheese model applies: Swiss cheese model - Wikipedia

btrfs users are advised to run btrfs scrub regularly. They may want to identify and remove the root cause of issues encountered by btrfs-scrub.service.

Switched scrub from monthly to weekly.

erlangen:~ # systemctl cat btrfs-scrub.timer 
# /usr/lib/systemd/system/btrfs-scrub.timer
[Unit]
Description=Scrub btrfs filesystem, verify block checksums
Documentation=man:btrfs-scrub

[Timer]
OnCalendar=monthly
AccuracySec=1h
Persistent=true

[Install]
WantedBy=timers.target

# /etc/systemd/system/btrfs-scrub.timer.d/schedule.conf
[Timer]
OnCalendar=weekly
erlangen:~ #

erlangen:~ # systemctl status btrfs-scrub.timer 
● btrfs-scrub.timer - Scrub btrfs filesystem, verify block checksums
     Loaded: loaded (/usr/lib/systemd/system/btrfs-scrub.timer; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/btrfs-scrub.timer.d
             └─schedule.conf
     Active: active (waiting) since Sun 2023-09-03 06:19:32 CEST; 3h 25min ago
    Trigger: Mon 2023-09-04 00:00:00 CEST; 14h left
   Triggers: ● btrfs-scrub.service
       Docs: man:btrfs-scrub

Sep 03 06:19:32 erlangen systemd[1]: Started Scrub btrfs filesystem, verify block checksums.
erlangen:~ #

hui · September 3, 2023, 8:10am

Again no facts, ignoring questions and showing only some random output from ridiculous hosts. So again:

How do you use btrfs scrub to monitor your SSD/HDD health like:

temperatures (min/max/act)
lifetime power-on resets
power on time
power cycle count
number of write commands
number of read commands
total bad blocks
erase count
online/offline short/extented self-tests
error logs
life courve status
media wear out indicator
reserved block count
and many more…

Additionally you ignore the fact that experienced users my don‘t use btrfs and set up their machines in a more professional way that suits their need by using another filesystem. So how does your btrfs scrub work on xfs, ext4 or any other up to date used filesystem?

The TO didn‘t say which filesystem he is using but asked for ways to check SSD (HDD) health in general. That means recommending a filesystem level tool (which only works with btrfs) which doesn‘ t even check basic hardware indicators (S.M.A.R.T) is useless and off topic.