Received "ata3: SError: { CommWake }" error messages for several weeks

Hi All,

I started to receive some strange ata error messages after a Tumbleweed upgrade a few weeks ago (on the 6th of April) complaining about CommWake errors and timeouts.

My setup:

  • Acer laptop from 2016-2017
  • 2 LITEON SSDs (2 x 256 GB), being connected through an LVM layer

Details:

  • The issue started happening after I upgraded to the Tumbleweed snapshot 20200402 with kernel-default 5.6.0
  • After this upgrade, sometimes when I booted up the computer and opened Mozilla Firefox as the first thing, my computer gradually froze for ~30 seconds and dmesg had the following error messages:

<a lot of other "testing the buffer" messages here>
   27.302899] testing the buffer
<etc>
   68.668156] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
  284.406129] **fuse: init (API version 7.31)**
  316.404915] ata3.00: exception Emask 0x0 SAct 0x2003ff SErr 0x40000 action 0x6 frozen
  316.404924] **ata3: SError: { CommWake }**
  316.404930] ata3.00: failed command: READ FPDMA QUEUED
  316.404942] ata3.00: cmd 60/80:00:a0:54:1a/00:00:17:00:00/40 tag 0 ncq dma 65536 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 **Emask 0x4 (timeout)**
  316.404946] ata3.00: status: { DRDY }
  316.404950] ata3.00: failed command: READ FPDMA QUEUED
  316.404959] ata3.00: cmd 60/80:08:a0:55:1a/00:00:17:00:00/40 tag 1 ncq dma 65536 in
                        res 40/00:00:01:4f:c2/00:00:00:00:00/00 **Emask 0x4 (timeout)**
  316.404963] ata3.00: status: { DRDY }
<similar errors repeating for multiple times>
  316.405106] ata3: **hard resetting link**
  316.716924] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
  316.720743] ata3.00: configured for UDMA/133
  316.720763] ahci 0000:00:17.0: port does not support device sleep
  316.720864] ata3.00: **device reported invalid CHS sector 0**
  316.720870] ata3.00: device reported invalid CHS sector 0
  316.720874] ata3.00: device reported invalid CHS sector 0
  316.720883] ata3.00: device reported invalid CHS sector 0
  316.720887] ata3.00: device reported invalid CHS sector 0
  316.720891] ata3.00: device reported invalid CHS sector 0
  316.720920] sd 2:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=31s
  316.720929] sd 2:0:0:0: [sdb] tag#0 Sense Key : Illegal Request [current] 
  316.720936] sd 2:0:0:0: **[sdb] tag#0 Add. Sense: Unaligned write command**
  316.720944] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 17 1a 54 a0 00 00 80 00
  316.720953] blk_update_request: I/O error, dev sdb, sector 387601568 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
  316.721015] sd 2:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=31s
  316.721021] sd 2:0:0:0: [sdb] tag#1 Sense Key : Illegal Request [current] 
  316.721027] sd 2:0:0:0: [sdb] tag#1 Add. Sense: Unaligned write command
  316.721034] sd 2:0:0:0: [sdb] tag#1 CDB: Read(10) 28 00 17 1a 55 a0 00 00 80 00
  316.721040] blk_update_request: I/O error, dev sdb, sector 387601824 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
<similar errors repeating for multiple times>
  316.721539] ata3: **EH complete**
  644.678357] BTRFS info (device dm-0): qgroup scan completed (inconsistency flag cleared)

  • side note: I don’t remember seeing these “testing the buffer” messages and the drm warning before, so it was probably a new thing with the 5.6.0 kernel. I didn’t capture that part of the log, but they were being logged after something like “starting bpFilter”, but I don’t think it’s relevant to the ata issue.
  • note: Opening Firefox for the first time after boot triggers the initialization of fuse, and I received the above errors almost always after "fuse: init". There was only one exception to this pattern (I couldn’t find a reason for that, there was no “fuse: init” at that moment. I was just doing some work in a virtualbox environment at ~300 seconds after boot).
  • I checked the SMART status every time after this error, but there is nothing alarming in there, except the “Command_Timeout” > 0, but I was following that and it didn’t increase at all during these weeks. The only values increasing here are the Power_Cycle_Count, Wear_Leveling_Count, Total_LBAs_Written and Total_LBAs_Read:

> sudo smartctl --all /dev/sdb
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.0-1-default] (SUSE RPM)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     LITEON <...>
Serial Number:    <...>
LU WWN Device Id: <...>
Firmware Version: <...>
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr  6 20:56:37 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: **PASSED**

General SMART Values:
Offline data collection status:  (0x03) Offline data collection activity
                                        is in progress.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 243) Self-test routine in progress...
                                        30% of test remaining.
Total time to complete Offline 
data collection:                (  600) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0002   100   100   000    Old_age   Always       -       144
 12 Power_Cycle_Count       0x0003   100   100   000    Pre-fail  Always       -       1792
177 Wear_Leveling_Count     0x0000   100   100   000    Old_age   Offline      -       41543
178 Used_Rsvd_Blk_Cnt_Chip  0x0003   100   100   000    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0003   100   100   000    Pre-fail  Always       -       0
182 Erase_Fail_Count_Total  0x0003   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0003   100   100   000    Pre-fail  Always       -       0
**188 Command_Timeout         0x0003   100   100   000    Pre-fail  Always       -       20**
189 Unknown_SSD_Attribute   0x0000   100   100   000    Old_age   Offline      -       44
191 Unknown_SSD_Attribute   0x0000   100   100   000    Old_age   Offline      -       0
192 Power-Off_Retract_Count 0x0003   100   100   000    Pre-fail  Always       -       66
196 Reallocated_Event_Count 0x0003   100   100   000    Pre-fail  Always       -       0
198 Offline_Uncorrectable   0x0003   100   100   000    Pre-fail  Always       -       0
199 UDMA_CRC_Error_Count    0x0003   100   100   000    Pre-fail  Always       -       0
232 Available_Reservd_Space 0x0003   100   100   010    Pre-fail  Always       -       100
241 Total_LBAs_Written      0x0003   100   100   000    Pre-fail  Always       -       94988
242 Total_LBAs_Read         0x0003   100   100   000    Pre-fail  Always       -       104448

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       142         -
# 2  Extended offline    Completed without error       00%       142         -
# 3  Extended offline    Completed without error       00%       142         -
# 4  Extended offline    Completed without error       00%       142         -
# 5  Extended offline    Completed without error       00%       142         -
# 6  Extended offline    Completed without error       00%       142         -
# 7  Extended offline    Completed without error       00%       142         -
# 8  Extended offline    Interrupted (host reset)      90%         5         -

Selective Self-tests/Logging not supported

  • the issue continued through several next snapshots, up until kernel 5.6.4 with the** 20200421** snapshot, and I haven’t got any of the errors since then.

Do you think it could be a temporary kernel issue, or should I worry about the health of my SSD?

Thank you very much,
Regards,
vetko

Unfortunately started to get similar errors again, but this time it’s not related to FUSE init, and also the computer was not freezing anymore for 30s, in fact I didn’t notice any freeze.
The Command_Timeout has increased from 20 to 22 in smartctl.
Kernel is: 5.7+


 6508.657545] ata2.00: exception Emask 0x0 SAct 0xf8 SErr 0x40000 action 0x6 frozen
 6508.657548] ata2: SError: { CommWake }
 6508.657549] ata2.00: failed command: READ FPDMA QUEUED
 6508.657552] ata2.00: cmd 60/00:18:70:cd:d8/01:00:06:00:00/40 tag 3 ncq dma 131072 in
                        res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)                                                                                                                                                                          
 6508.657554] ata2.00: status: { DRDY }
 6508.657555] ata2.00: failed command: READ FPDMA QUEUED
 6508.657557] ata2.00: cmd 60/00:20:70:d2:d8/01:00:06:00:00/40 tag 4 ncq dma 131072 in
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)                                                                                                                                                                          
 6508.657558] ata2.00: status: { DRDY }
<similar failed command sections repeating multiple times>
 6508.657573] ata2: hard resetting link
 6508.972468] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 6508.975908] ata2.00: configured for UDMA/133
 6508.975915] ahci 0000:00:17.0: port does not support device sleep
 6508.975948] ata2: EH complete
 6552.953372] ata2.00: exception Emask 0x0 SAct 0x80fc0101 SErr 0x40000 action 0x6 frozen
 6552.953386] ata2: SError: { CommWake }
 6552.953393] ata2.00: failed command: READ FPDMA QUEUED
 6552.953409] ata2.00: cmd 60/a8:00:80:49:49/05:00:07:00:00/40 tag 0 ncq dma 741376 in
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)                                                                                                                                                                          
 6552.953414] ata2.00: status: { DRDY }
 6552.953420] ata2.00: failed command: WRITE FPDMA QUEUED
 6552.953432] ata2.00: cmd 61/10:40:08:5f:e2/00:00:19:00:00/40 tag 8 ncq dma 8192 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)                                                                                                                                                                          
 6552.953437] ata2.00: status: { DRDY }
<similar failed command sections repeating multiple times>
 6552.953606] ata2: hard resetting link
 6553.267998] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 6553.271386] ata2.00: configured for UDMA/133
 6553.271396] ahci 0000:00:17.0: port does not support device sleep
 6553.271429] ata2.00: device reported invalid CHS sector 0
 6553.271444] ata2: EH complete
 6814.839572] ata2.00: exception Emask 0x0 SAct 0x408000ff SErr 0x40000 action 0x6 frozen
 6814.839575] ata2: SError: { CommWake }
 6814.839577] ata2.00: failed command: READ FPDMA QUEUED
 6814.839580] ata2.00: cmd 60/00:00:70:cc:d8/01:00:06:00:00/40 tag 0 ncq dma 131072 in
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
 6814.839581] ata2.00: status: { DRDY }
 6814.839582] ata2.00: failed command: READ FPDMA QUEUED
 6814.839586] ata2.00: cmd 60/00:08:70:d1:d8/01:00:06:00:00/40 tag 1 ncq dma 131072 in
                        res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 6814.839587] ata2.00: status: { DRDY }
<similar failed command sections repeating multiple times>
 6814.839636] ata2: hard resetting link
 6815.154034] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 6815.157523] ata2.00: configured for UDMA/133
 6815.157532] ahci 0000:00:17.0: port does not support device sleep
 6815.157591] ata2: EH complete

Does anybody has experience with these kind of issues?

Thank you,
vetko

I’ve seen this a couple of times, once with a failing drive and other with failing sata ports on the motherboard.
So I’d guess it’s a hardware problem, but YMMV. If a desktop, you could try changing the sata cable.

Further to the above, I have no idea how, if at all, LVM would affect this.
Also the CHS0 errors are interesting, IINM in a mbr disk that’s the sector the mbr code is written. AFAIR when that sector goes the drive is gone.

Again, just guessing.

Maybe yes, maybe no. Maybe drive will work with GPT after reformatting.

To OP: check cables and power supply. Check drive with another hardware. Probably BIOS update may help.