Hey there,
this is my first post, so please correct me if I did sth wrong ![]()
I first describe my problem, then I ask for some ideas what the issue could be.
Problem Description: I am experiencing intermittent but severe system freezes (approx. 9-10 seconds), specifically when gaming during save games or loading “complex” web pages. The system does not fully crash, but the IO hangs completely for several seconds before recovering.
This behavior started appearing recently, likely correlating with the shift to Kernel 6.18.9-1-default or to the Mesa 26 stack. Unofrtunately, I am not sure when this started. Sometimes I boot the PC and it just works, sometimes these freezes happen every 5 min. The system was very stable for months prior to these updates.
Investigation & Logs: dmesg -w reveals that the NVMe drive is timing out and aborting write requests. It seems the drive enters a power-saving state or fails to send an interrupt during heavy GPU bus load, leading to a “lost interrupt” scenario.
Here are the relevant logs capturing the freeze:
[ 4400.554247] [ T290] nvme nvme0: I/O tag 264 (2108) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:262144
[ 4400.554257] [ T290] nvme nvme0: I/O tag 265 (2109) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:262144
[ 4400.554260] [ T290] nvme nvme0: I/O tag 266 (e10a) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:262144
[ 4400.554261] [ T290] nvme nvme0: I/O tag 267 (d10b) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:262144
...
[ 4408.142488] [ C10] nvme nvme0: Abort status: 0x0
[ 4408.193624] [ C10] nvme nvme0: Abort status: 0x0
[ 4408.204505] [ C10] nvme nvme0: Abort status: 0x0
[ 4408.220136] [ C10] nvme nvme0: Abort status: 0x0
Also, sporadic timeouts occurred:
[ 276.938426] [ T488] nvme nvme0: I/O tag 150 (8096) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:262144
...
[ 290.834301] [ C10] nvme nvme0: Abort status: 0x0
Troubleshooting Steps Taken:
GPU and CPU are in performance mode/governor.
I have verified that the SSD is physically installed in the top M.2 slot (CPU lanes, Bus 04:00.0) via lspci -tv, so it is not routed through the Promontory chipset.
I have applied the following fixes, which seem to mitigate the issue but require aggressive settings:
- Filesystem:
- Disabled BTRFS quotas (
btrfs quota disable /) and removeddiscardfrom/etc/fstab(replaced with periodic fstrim timer) to reduce controller load.
- Kernel Parameters:
nvme_core.default_ps_max_latency_us=0(To disable APST)pcie_aspm.policy=performance(To disable ASPM/L1 states)
Idea: It appears there is a conflict involving the Kioxia/Phison controller, the new aggressive power management in Kernel 6.19, and potentially the IOMMU dropping interrupts during high bandwidth usage by the RDNA 4 GPU (Mesa 26).
Next steps:
- Adding
iommu=soft(or disabling IOMMU in BIOS under AMD CBS → NBIO) seems to be the only way to reliably stop the QID timeouts. However, I do not want to deactivate IOMMU to be honest.
My question, does anybody else see these issues or has another idea what the problem could be?
Specs:
- OS: OpenSUSE Tumbleweed
- Kernel: Linux 6.18.9-1-default
- CPU: AMD Ryzen 7 9700x (Granite Ridge)
- Motherboard: ASRock B850 Pro-A WiFi
- GPU: AMD Radeon RX 9070 XT (Navi 48 / RDNA 4) using Mesa 26.0 (git/rc)
- Storage: Kioxia Exceria Plus G3 NVMe (DRAM-less), installed in M2_1 (CPU-attached Slot).
- Filesystem: BTRFS
Thanks for you help ![]()