Received "ata3: SError: { CommWake }" error messages for several weeks

vetko · April 29, 2020, 1:02am

Hi All,

I started to receive some strange ata error messages after a Tumbleweed upgrade a few weeks ago (on the 6th of April) complaining about CommWake errors and timeouts.

My setup:

Acer laptop from 2016-2017
2 LITEON SSDs (2 x 256 GB), being connected through an LVM layer

Details:

The issue started happening after I upgraded to the Tumbleweed snapshot 20200402 with kernel-default 5.6.0
After this upgrade, sometimes when I booted up the computer and opened Mozilla Firefox as the first thing, my computer gradually froze for ~30 seconds and dmesg had the following error messages:


<a lot of other "testing the buffer" messages here>
   27.302899] testing the buffer
<etc>
   68.668156] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
  284.406129] **fuse: init (API version 7.31)**
  316.404915] ata3.00: exception Emask 0x0 SAct 0x2003ff SErr 0x40000 action 0x6 frozen
  316.404924] **ata3: SError: { CommWake }**
  316.404930] ata3.00: failed command: READ FPDMA QUEUED
  316.404942] ata3.00: cmd 60/80:00:a0:54:1a/00:00:17:00:00/40 tag 0 ncq dma 65536 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 **Emask 0x4 (timeout)**
  316.404946] ata3.00: status: { DRDY }
  316.404950] ata3.00: failed command: READ FPDMA QUEUED
  316.404959] ata3.00: cmd 60/80:08:a0:55:1a/00:00:17:00:00/40 tag 1 ncq dma 65536 in
                        res 40/00:00:01:4f:c2/00:00:00:00:00/00 **Emask 0x4 (timeout)**
  316.404963] ata3.00: status: { DRDY }
<similar errors repeating for multiple times>
  316.405106] ata3: **hard resetting link**
  316.716924] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
  316.720743] ata3.00: configured for UDMA/133
  316.720763] ahci 0000:00:17.0: port does not support device sleep
  316.720864] ata3.00: **device reported invalid CHS sector 0**
  316.720870] ata3.00: device reported invalid CHS sector 0
  316.720874] ata3.00: device reported invalid CHS sector 0
  316.720883] ata3.00: device reported invalid CHS sector 0
  316.720887] ata3.00: device reported invalid CHS sector 0
  316.720891] ata3.00: device reported invalid CHS sector 0
  316.720920] sd 2:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=31s
  316.720929] sd 2:0:0:0: [sdb] tag#0 Sense Key : Illegal Request [current] 
  316.720936] sd 2:0:0:0: **[sdb] tag#0 Add. Sense: Unaligned write command**
  316.720944] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 17 1a 54 a0 00 00 80 00
  316.720953] blk_update_request: I/O error, dev sdb, sector 387601568 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
  316.721015] sd 2:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=31s
  316.721021] sd 2:0:0:0: [sdb] tag#1 Sense Key : Illegal Request [current] 
  316.721027] sd 2:0:0:0: [sdb] tag#1 Add. Sense: Unaligned write command
  316.721034] sd 2:0:0:0: [sdb] tag#1 CDB: Read(10) 28 00 17 1a 55 a0 00 00 80 00
  316.721040] blk_update_request: I/O error, dev sdb, sector 387601824 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
<similar errors repeating for multiple times>
  316.721539] ata3: **EH complete**
  644.678357] BTRFS info (device dm-0): qgroup scan completed (inconsistency flag cleared)

side note: I don’t remember seeing these “testing the buffer” messages and the drm warning before, so it was probably a new thing with the 5.6.0 kernel. I didn’t capture that part of the log, but they were being logged after something like “starting bpFilter”, but I don’t think it’s relevant to the ata issue.
note: Opening Firefox for the first time after boot triggers the initialization of fuse, and I received the above errors almost always after "fuse: init". There was only one exception to this pattern (I couldn’t find a reason for that, there was no “fuse: init” at that moment. I was just doing some work in a virtualbox environment at ~300 seconds after boot).
I checked the SMART status every time after this error, but there is nothing alarming in there, except the “Command_Timeout” > 0, but I was following that and it didn’t increase at all during these weeks. The only values increasing here are the Power_Cycle_Count, Wear_Leveling_Count, Total_LBAs_Written and Total_LBAs_Read:


> sudo smartctl --all /dev/sdb
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.0-1-default] (SUSE RPM)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     LITEON <...>
Serial Number:    <...>
LU WWN Device Id: <...>
Firmware Version: <...>
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr  6 20:56:37 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: **PASSED**

General SMART Values:
Offline data collection status:  (0x03) Offline data collection activity
                                        is in progress.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 243) Self-test routine in progress...
                                        30% of test remaining.
Total time to complete Offline 
data collection:                (  600) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0002   100   100   000    Old_age   Always       -       144
 12 Power_Cycle_Count       0x0003   100   100   000    Pre-fail  Always       -       1792
177 Wear_Leveling_Count     0x0000   100   100   000    Old_age   Offline      -       41543
178 Used_Rsvd_Blk_Cnt_Chip  0x0003   100   100   000    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0003   100   100   000    Pre-fail  Always       -       0
182 Erase_Fail_Count_Total  0x0003   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0003   100   100   000    Pre-fail  Always       -       0
**188 Command_Timeout         0x0003   100   100   000    Pre-fail  Always       -       20**
189 Unknown_SSD_Attribute   0x0000   100   100   000    Old_age   Offline      -       44
191 Unknown_SSD_Attribute   0x0000   100   100   000    Old_age   Offline      -       0
192 Power-Off_Retract_Count 0x0003   100   100   000    Pre-fail  Always       -       66
196 Reallocated_Event_Count 0x0003   100   100   000    Pre-fail  Always       -       0
198 Offline_Uncorrectable   0x0003   100   100   000    Pre-fail  Always       -       0
199 UDMA_CRC_Error_Count    0x0003   100   100   000    Pre-fail  Always       -       0
232 Available_Reservd_Space 0x0003   100   100   010    Pre-fail  Always       -       100
241 Total_LBAs_Written      0x0003   100   100   000    Pre-fail  Always       -       94988
242 Total_LBAs_Read         0x0003   100   100   000    Pre-fail  Always       -       104448

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       142         -
# 2  Extended offline    Completed without error       00%       142         -
# 3  Extended offline    Completed without error       00%       142         -
# 4  Extended offline    Completed without error       00%       142         -
# 5  Extended offline    Completed without error       00%       142         -
# 6  Extended offline    Completed without error       00%       142         -
# 7  Extended offline    Completed without error       00%       142         -
# 8  Extended offline    Interrupted (host reset)      90%         5         -

Selective Self-tests/Logging not supported

the issue continued through several next snapshots, up until kernel 5.6.4 with the** 20200421** snapshot, and I haven’t got any of the errors since then.

Do you think it could be a temporary kernel issue, or should I worry about the health of my SSD?

Thank you very much,
Regards,
vetko

vetko · July 3, 2020, 1:05am

Unfortunately started to get similar errors again, but this time it’s not related to FUSE init, and also the computer was not freezing anymore for 30s, in fact I didn’t notice any freeze.
The Command_Timeout has increased from 20 to 22 in smartctl.
Kernel is: 5.7+


 6508.657545] ata2.00: exception Emask 0x0 SAct 0xf8 SErr 0x40000 action 0x6 frozen
 6508.657548] ata2: SError: { CommWake }
 6508.657549] ata2.00: failed command: READ FPDMA QUEUED
 6508.657552] ata2.00: cmd 60/00:18:70:cd:d8/01:00:06:00:00/40 tag 3 ncq dma 131072 in
                        res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)                                                                                                                                                                          
 6508.657554] ata2.00: status: { DRDY }
 6508.657555] ata2.00: failed command: READ FPDMA QUEUED
 6508.657557] ata2.00: cmd 60/00:20:70:d2:d8/01:00:06:00:00/40 tag 4 ncq dma 131072 in
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)                                                                                                                                                                          
 6508.657558] ata2.00: status: { DRDY }
<similar failed command sections repeating multiple times>
 6508.657573] ata2: hard resetting link
 6508.972468] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 6508.975908] ata2.00: configured for UDMA/133
 6508.975915] ahci 0000:00:17.0: port does not support device sleep
 6508.975948] ata2: EH complete
 6552.953372] ata2.00: exception Emask 0x0 SAct 0x80fc0101 SErr 0x40000 action 0x6 frozen
 6552.953386] ata2: SError: { CommWake }
 6552.953393] ata2.00: failed command: READ FPDMA QUEUED
 6552.953409] ata2.00: cmd 60/a8:00:80:49:49/05:00:07:00:00/40 tag 0 ncq dma 741376 in
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)                                                                                                                                                                          
 6552.953414] ata2.00: status: { DRDY }
 6552.953420] ata2.00: failed command: WRITE FPDMA QUEUED
 6552.953432] ata2.00: cmd 61/10:40:08:5f:e2/00:00:19:00:00/40 tag 8 ncq dma 8192 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)                                                                                                                                                                          
 6552.953437] ata2.00: status: { DRDY }
<similar failed command sections repeating multiple times>
 6552.953606] ata2: hard resetting link
 6553.267998] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 6553.271386] ata2.00: configured for UDMA/133
 6553.271396] ahci 0000:00:17.0: port does not support device sleep
 6553.271429] ata2.00: device reported invalid CHS sector 0
 6553.271444] ata2: EH complete
 6814.839572] ata2.00: exception Emask 0x0 SAct 0x408000ff SErr 0x40000 action 0x6 frozen
 6814.839575] ata2: SError: { CommWake }
 6814.839577] ata2.00: failed command: READ FPDMA QUEUED
 6814.839580] ata2.00: cmd 60/00:00:70:cc:d8/01:00:06:00:00/40 tag 0 ncq dma 131072 in
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
 6814.839581] ata2.00: status: { DRDY }
 6814.839582] ata2.00: failed command: READ FPDMA QUEUED
 6814.839586] ata2.00: cmd 60/00:08:70:d1:d8/01:00:06:00:00/40 tag 1 ncq dma 131072 in
                        res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 6814.839587] ata2.00: status: { DRDY }
<similar failed command sections repeating multiple times>
 6814.839636] ata2: hard resetting link
 6815.154034] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 6815.157523] ata2.00: configured for UDMA/133
 6815.157532] ahci 0000:00:17.0: port does not support device sleep
 6815.157591] ata2: EH complete

Does anybody has experience with these kind of issues?

Thank you,
vetko

brunomcl · July 3, 2020, 10:00pm

I’ve seen this a couple of times, once with a failing drive and other with failing sata ports on the motherboard.
So I’d guess it’s a hardware problem, but YMMV. If a desktop, you could try changing the sata cable.

brunomcl · July 3, 2020, 10:04pm

Further to the above, I have no idea how, if at all, LVM would affect this.
Also the CHS0 errors are interesting, IINM in a mbr disk that’s the sector the mbr code is written. AFAIR when that sector goes the drive is gone.

Again, just guessing.

Svyatko · July 6, 2020, 2:56pm

Maybe yes, maybe no. Maybe drive will work with GPT after reformatting.

To OP: check cables and power supply. Check drive with another hardware. Probably BIOS update may help.