Samsung 860 EVO SSD with AMD controller - have to disable NCQ?

Is anyone running Tumbleweed with an AMD controller and the Samsung 860 EVO SSD? If yes, is it working OK with NCQ enabled?

Having just migrated my /home to a Samsung 860 EVO SSD I started seeing errors such as

Jan 30 13:18:48 kosmos1 kernel: ata4.00: failed command: WRITE FPDMA QUEUED
Jan 30 13:18:48 kosmos1 kernel: ata4.00: cmd 61/38:10:18:7a:d6/00:00:03:00:00/40 tag 2 ncq dma 28672 out
                                        res 40/00:58:78:67:d6/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Jan 30 13:18:48 kosmos1 kernel: ata4.00: status: { DRDY }

The errors may be related to AMD controllers and the 860 EVO. In my case that would be the AMD 970/SB950.

https://bugzilla.kernel.org/show_bug.cgi?id=201693https://wiki.archlinux.org/index.php/Solid_state_drive#Resolving_NCQ_errors
https://eu.community.samsung.com/t5/Cameras-IT-Everything-Else/860-EVO-250GB-causing-freezes-on-AMD-system/td-p/575813/page/4

I’ve read these kinds of errors can also be due to bad cables and bad power supplies. My system was working fine with all the same components and cables and an EVO 850 as it’s root drive. I’ve re-seated all cables and swapped them around, with no change.

In my case the kernel seems to have managed to recover from the errors and when compared with backups I’ve yet to discover any corruption. I’ve disabled NCQ and the errors have not recurred.

Problem occurs regardless OS: it exists also with Leap and Windows.

Disabling NCQ cripples SSD perfomance.
Enabling NCQ may lead to unrecoverable errors, and also cripples perfomance by lowering SATA speed and other techniques.

The best way: use another SSD.

Another ways:

  1. Use another controller. ASMedia ASM1061 with updated AHCI firmware is good enough.

Using another NVME drive:
2. For UEFI boot: include NVME driver into BIOS or load it from some media before using it (some FAT16/FAT32 partition).
3. For legacy boot: use separate “/boot” on SATA/IDE/USB media + NVME drive.

AFAIK this drive has problems with Linux’s queued TRIM.

Thanks for the warning. From what I’ve googled it seems this problem was resolved ages ago, probably way before I bought mine. Anyway the fstrim timer has been running for months (years) with no issues on two desktops here.

I ran some jobs on my desktop driving reads to about 525 MB/s for queue depth 31. I tried queue depths 1 and 8.

echo 8 > /sys/block/sdb/device/queue_depth

A queue depth of 1 dropped the read rate to 455 MB/s. It’s a 13.3% loss, but not a crippling loss (the loss might be bigger for other kinds of workloads). In the name of stability I can live with that (for a desktop PC).

The reason I choose to test 8 was this blog post:

https://strugglers.net/~andy/blog/2015/08/09/ssds-and-linux-native-command-queuing/

Which seems to prove that for some usage scenarios, particularly for server level products, disabling NCQ will have a bigger impact.

I wonder if 8 might be more stable than 31 for the 860 EVO and AMD controllers, because it seems to perform as well as 31?

Switching to another controller is a good suggestion. But goolge reveals that different controllers and firmware variants may have different issues, so it will have to wait until I’m hungry for another 13% of throughput. I think for my desktop use, this might be good enough.