Leap & TW: Enormously high write amplification with BTRFS

Svyatko · January 3, 2021, 8:58am

Based on this work (in russian, commands in english): https://habr.com/ru/post/476414/
Waited for translation, but no luck.

Main result: with writing some kilobytes of data to disk BTRFS writes some megabytes (3 MiB).

With such activity you may exceed TBW value and loose warranty for a SSD.

Conclusions:

Do not use BTRFS on a weak flash - SD cards, QLC SSDs, etc.
Be careful with processes that constantly writes small amount of data to SSD + BTRFS.

unix111 · January 3, 2021, 9:48am

I agree and add another point:
3. Make sure all SSD-backed partitions are properly aligned. YaST’s partitioner does a good job at aligning partitions while it creates them, but you may already use SSDs with non-aligned partitions somewhere else. As the creator and maintainer of the ext4 filesystem, Theodore Ts’o, wrote about the problem of write amplification with respect to SSD block boundaries:

The initial round of 4k sector disks will emulate 512 byte disks, but if the partitions are not 4k aligned, then the disk will end up doing a read/modify/write on two internal 4k sectors for each singleton 4k file system write, and that would be unfortunate.

As to minimizing writes to the SSD in general, I might have gone overboard a bit:

If any program uses cache directories (browsers, Google Map, Steam, thumbnail tools etc.), I’ve searched for (and usually found, too) ways to disable that on-disk caching.
I never use Swap partitions on my SSDs. Use enough physical RAM (8GB is plenty for me), and let wasteful/buggy programs fail early. Currently, I haven’t even compiled in any swap functionality into my 5.9.9 custom kernel; it also doesn’t know how to unload modules and supports only the ext4 filesystem, but that’s another story of [over-?] optimization; it’s a hobby!
I also don’t use (U)EFI boot partitions on my home gear, it’s MBR layout on SSD only for me; 1 (one) aligned ext4 partition per SSD. (Only exception: 2 MacBooks with the usual OpenFirmware-based bootmanager stuff.)
For boot optimization purposes, frequent runs of dracut
have been necessary during months and years of testing. That’s nice to do on an SSD, because of the quick turn-arounds. But it also writes around 50 MB of initrd to SSD every time. My custom kernel knows how to patch CPU microcode and mount its ext4 partition without initrd. This resulted in far less initd-/boot-related writes.
I don’t let systemd (via cron job) or ext4 (via discard option) run fstrim
on my SSDs; fstrim would happen too often if I’d let them, and too frequent fstrim runs reportedly are suboptimal for SSD longevity. I prefer to lauch fstrim manually as root one a month: each time before my monthly backup runs, I empty /tmp, ~/.cache/ and so on; then I backup my data; then I do fstrim. Once per month, no exception.

unix111 · January 3, 2021, 10:13am

Forgot to add another rule:

Leave enough free space on the SSD for its firmware to do its discard-/wear-levelling runs effectively.

Now I’ve read that one possibility would be a large and rarely used swap partition, but I doubt that. I know of no mechanism the in-kernel swapping code or the swapon program could communicate to the SSD controller that swap is empty and ready for wear-levelling. In contrast, fstrim does exactly that: collect all free space (especially recently marked free by deleted files) on the filesystem. It can do that because it can request that information from the filesystem driver (ext4 in my case). I don’t know about any comparable mechanism with swap.

So I doubt that it’s ok to fill the SSD to 100% with data while leaving a swap partition for wear-levelling. I just try to stay under 50% occupied storage space for small SSDs and under 80% for larger ones.

Another interesting read by Ted Ts’o I’ve just found: Should Filesystems Be Optimized for SSD’s?

Cheers!

arvidjaar · January 3, 2021, 11:55am

man 8 swapon
/discard

malcolmlewis · January 3, 2021, 3:25pm

Hi
If you are all that worried, then likely better to use spinning rust… it’s a bit like 'hey my system is using all the installed RAM"

My main desktop;
nvme PoH 11,278, Written 0.604GBph
ssd PoH 9,966, Written 0.201 GBph

Svyatko · January 3, 2021, 9:12pm

RAM is not breaking with a too much usage.

malcolmlewis · January 4, 2021, 12:38am

Hi
Neither is btrfs and SSD’s that I have seen, I’ve been running SLES, SLED and openSUSE on SSD’s since it first came out with btrfs, big issues with snapper in the beginning, not lost one drive for writing too much…

Now in saying that I would still steer clear of certain brands that have had issues and are still on the blacklist for queued trim etc…

At line 3744 onwards for the blacklist: <https://github.com/torvalds/linux/blob/master/drivers/ata/libata-core.c>

Always worth a check of the above before proceeding to purchase…

awerlang · January 4, 2021, 3:26am

Generally I agree to both conclusions.

Quick links:

Article in Google Translate: Выявляем процессы с дисковой активностью в Linux / Habr
Cited paper: https://arxiv.org/pdf/1707.08514.pdf

IMO the paper does a better job bringing up the concern of write/read/space amplification, since it discusses different FSs and also does macro-benchmarks. The article presents some useful tools to diagnose the system though.

My main desktop outputs ~6 TBW after an year usage for my drive which warrants 300 TBW. I do have swap on it for hibernation purposes, it’s dead slow on HDD otherwise. Swap sits unused most of the time. Firefox cache is store on a regular CoW directory currently (which is where it performs the worst). I could disable caching to disk on about:config as I have done in the very beginning, I don’t recall why I restored it. Lately it has increased the TBW count, I supposed is from heavier container usage. Still, I don’t foresee trouble with this setup, apart from lack of space

tsu2 · January 4, 2021, 5:37am

unix111:

Forgot to add another rule:

Leave enough free space on the SSD for its firmware to do its discard-/wear-levelling runs effectively.

Now I’ve read that one possibility would be a large and rarely used swap partition, but I doubt that. I know of no mechanism the in-kernel swapping code or the swapon program could communicate to the SSD controller that swap is empty and ready for wear-levelling. In contrast, fstrim does exactly that: collect all free space (especially recently marked free by deleted files) on the filesystem. It can do that because it can request that information from the filesystem driver (ext4 in my case). I don’t know about any comparable mechanism with swap.

So I doubt that it’s ok to fill the SSD to 100% with data while leaving a swap partition for wear-levelling. I just try to stay under 50% occupied storage space for small SSDs and under 80% for larger ones.

Another interesting read by Ted Ts’o I’ve just found: Should Filesystems Be Optimized for SSD’s?

Cheers!

No, not really or necessarily.
5 years ago there were at least 3 popular ways OEMs reserved and used memory for various housekeeping like wear leveling and ensuring sufficient for the process of clearing traps.
Only if you were one of the unlucky ones (or maybe didn’t do your pre-purchase homework?) would you need to actively manage your lower level SSD functionality.
Most Users won’t need to think about these things.

TSU

karlmistelberger · January 6, 2021, 8:48am

The Samsung model 950 PRO 512GB from August 2016 has slightly higher value: 16.6 TB written in 16,924 hours = 0.981 GB/h.

Warranty is 5 years / 400 TBW. Assuming write amplification 32 x would result in end of life, but:

Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 1%

I am confident the drive will last for some more years.

malcolmlewis · January 6, 2021, 3:23pm

Hi
Likewise, I prefer to use my computer(s) as a tool, not something to constantly worry about (and backups), like I said real life experience, no issues…

Mine is the same…

Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 1%

arvidjaar · January 22, 2021, 6:20pm

https://lore.kernel.org/linux-btrfs/20210121222051.GB4626@dread.disaster.area/T/#u

Of course this is rather extreme workload.

tsu2 · January 26, 2021, 4:22am

I ran the referenced article through a translator (not any of the comments running down the page)
And don’t see anything that any write amplification mentioned, so is write amplification only an issue you ran into and not in the article?

Based on various benchmarks I’ve read, no one has mentioned any write amplification related to BTRFS that I can remember.
The main possible cause is what was mentioned earlier in this post… mis-alignment. And, of course the more fragmented your files are, the more write amplification you’re going to get.
The only extra latency/workload you might see related to BTRFS should be related to its journaling and generally people would notice only if they’re running a high performance workload like first person shooter online gaming. I wouldn’t expect that the extra latency making backups should be too bothersome.

Some personal opinions about what is described in the article…
By bit copying using dd, an exact copy is being made as backup… complete with deleted data, temporary files, etc because his script doesn’t do any housekeeping and prep before copying.
IIRC deduplicating is already done to some degree in BTRFS, I suppose details are important whatever the Russian author was looking for.
For creating backups, I highly recommend using tar instead of dd, you’ll benefit greatly from numerous features like compression.
The Russian author chose to set up crontabs. You might consider systemd timer tasks instead.

TSu