I have two Compulab Fitlet2. They are small, semi-rugged PCs with an Atom CPU.
They both have disk problems after some hours (as short as 40 minutes, as long as 15 hours) when they end up with this is kernel log:
1868.283972] ata1.00: exception Emask 0x0 SAct 0x300 SErr 0x0 action 0x6 frozen
1868.283977] ata1.00: failed command: WRITE FPDMA QUEUED
1868.283986] ata1.00: cmd 61/58:40:78:e9:53/00:00:01:00:00/40 tag 8 ncq dma 45056 out
res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
1868.283988] ata1.00: status: { DRDY }
1868.283991] ata1.00: failed command: WRITE FPDMA QUEUED
1868.283998] ata1.00: cmd 61/08:48:40:d1:23/00:00:00:00:00/40 tag 9 ncq dma 4096 out
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
1868.284000] ata1.00: status: { DRDY }
1868.284006] ata1: hard resetting link
1878.299458] ata1: softreset failed (device not ready)
1878.299465] ata1: hard resetting link
and from there it rapidly goes wrong:
1928.562698] EXT4-fs (sda2): Remounting filesystem read-only
This happens on both boxes.
I have searched the internet for possible causes:
-
Bad power supply: since it happens on both boxes that seems unlikely
-
Bad disk:since it happens on both boxes that seems unlikely
-
Similar errors can be due to bad Marvell firmware, but these boxes don’t use that.
-
Heavy load and NCQ bug: these machines were idle (left overnight)
Nothing interesting from “smartctl -a”
The Linux Mint that Compulab usually installs allegedly don’t have these problems.
kernel:
Linux linux-gj0i 4.12.14-lp150.12.16-default #1 SMP Tue Aug 14 17:51:27 UTC 2018 (28574e6) x86_64 x86_64 x86_64 GNU/Linux
lspci:
linux-gj0i:~ # lspci
00:00.0 Host bridge: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series Host Bridge (rev 0b)
00:02.0 VGA compatible controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series Integrated Graphics Controller (rev 0b)
00:0e.0 Audio device: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series Audio Cluster (rev 0b)
00:0f.0 Communication controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series Trusted Execution Engine (rev 0b)
00:12.0 SATA controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SATA AHCI Controller (rev 0b)
00:13.0 PCI bridge: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port A #1 (rev fb)
00:13.1 PCI bridge: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port A #2 (rev fb)
00:13.2 PCI bridge: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port A #3 (rev fb)
00:15.0 USB controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series USB xHCI (rev 0b)
00:18.0 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series HSUART Controller #1 (rev 0b)
00:18.1 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series HSUART Controller #2 (rev 0b)
00:18.2 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series HSUART Controller #3 (rev 0b)
00:18.3 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series HSUART Controller #4 (rev 0b)
00:19.0 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SPI Controller #1 (rev 0b)
00:19.1 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SPI Controller #2 (rev 0b)
00:19.2 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SPI Controller #3 (rev 0b)
00:1b.0 SD Host controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SDXC/MMC Host Controller (rev 0b)
00:1c.0 SD Host controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series eMMC Controller (rev 0b)
00:1e.0 SD Host controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SDIO Controller (rev 0b)
00:1f.0 ISA bridge: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series Low Pin Count Interface (rev 0b)
00:1f.1 SMBus: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SMBus Controller (rev 0b)
01:00.0 Network controller: Intel Corporation Wireless 8260 (rev 3a)
02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
smartctl:
linux-gj0i:~ # smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp150.11-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: NT-128
Serial Number: 087110270002
LU WWN Device Id: 0 000000 000000000
Firmware Version: Q0407A
User Capacity: 126,701,535,232 bytes [126 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Aug 17 12:27:52 2018 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 1) minutes.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0000 100 100 050 Old_age Offline - 0
5 Reallocated_Sector_Ct 0x0002 100 100 050 Old_age Always - 0
9 Power_On_Hours 0x0000 100 100 050 Old_age Offline - 0
12 Power_Cycle_Count 0x0000 100 100 050 Old_age Offline - 4
160 Unknown_Attribute 0x0000 100 100 050 Old_age Offline - 0
161 Unknown_Attribute 0x0000 100 100 050 Old_age Offline - 61
162 Unknown_Attribute 0x0000 100 100 050 Old_age Offline - 1
163 Unknown_Attribute 0x0000 100 100 050 Old_age Offline - 7
164 Unknown_Attribute 0x0000 100 100 050 Old_age Offline - 31
165 Unknown_Attribute 0x0000 100 100 050 Old_age Offline - 1
166 Unknown_Attribute 0x0000 100 100 050 Old_age Offline - 0
167 Unknown_Attribute 0x0000 100 100 050 Old_age Offline - 0
168 Unknown_Attribute 0x0000 100 100 050 Old_age Offline - 3000
169 Unknown_Attribute 0x0000 100 100 050 Old_age Offline - 100
192 Power-Off_Retract_Count 0x0000 100 100 050 Old_age Offline - 4
194 Temperature_Celsius 0x0000 100 100 050 Old_age Offline - 39
195 Hardware_ECC_Recovered 0x0000 100 100 050 Old_age Offline - 0
196 Reallocated_Event_Count 0x0000 100 100 050 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0000 100 100 050 Old_age Offline - 0
241 Total_LBAs_Written 0x0000 100 100 050 Old_age Offline - 148
242 Total_LBAs_Read 0x0000 100 100 050 Old_age Offline - 19
245 Unknown_Attribute 0x0000 100 100 050 Old_age Offline - 155
Warning! SMART ATA Error Log Structure error: invalid SMART checksum.
SMART Error Log Version: 1
Invalid Error Log index = 0x20 (T13/1321D rev 1c Section 8.41.6.8.2.2 gives valid range from 1 to 5)
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Selective Self-tests/Logging not supported
Kernel log / dmesg: (truncated due to post size limit)
linux-gj0i:~ # dmesg
...
12.692225] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
12.909416] NET: Registered protocol family 17
1543.174153] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[52270.160098] ata1.00: exception Emask 0x0 SAct 0x1fe00000 SErr 0x0 action 0x6 frozen
[52270.160104] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160112] ata1.00: cmd 61/08:a8:00:a8:0f/00:00:00:00:00/40 tag 21 ncq dma 4096 out
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160114] ata1.00: status: { DRDY }
[52270.160117] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160140] ata1.00: cmd 61/08:b0:80:a8:8f/00:00:00:00:00/40 tag 22 ncq dma 4096 out
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160141] ata1.00: status: { DRDY }
[52270.160143] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160149] ata1.00: cmd 61/08:b8:f8:aa:8f/00:00:00:00:00/40 tag 23 ncq dma 4096 out
res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160150] ata1.00: status: { DRDY }
[52270.160152] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160158] ata1.00: cmd 61/08:c0:18:ac:8f/00:00:00:00:00/40 tag 24 ncq dma 4096 out
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160159] ata1.00: status: { DRDY }
[52270.160161] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160167] ata1.00: cmd 61/08:c8:48:ac:8f/00:00:00:00:00/40 tag 25 ncq dma 4096 out
res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160169] ata1.00: status: { DRDY }
[52270.160170] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160176] ata1.00: cmd 61/08:d0:70:ac:8f/00:00:00:00:00/40 tag 26 ncq dma 4096 out
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160178] ata1.00: status: { DRDY }
[52270.160179] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160185] ata1.00: cmd 61/08:d8:38:ac:90/00:00:00:00:00/40 tag 27 ncq dma 4096 out
res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160187] ata1.00: status: { DRDY }
[52270.160188] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160194] ata1.00: cmd 61/28:e0:60:7a:54/00:00:01:00:00/40 tag 28 ncq dma 20480 out
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160196] ata1.00: status: { DRDY }
[52270.160201] ata1: hard resetting link
[52280.175583] ata1: softreset failed (device not ready)
[52280.175590] ata1: hard resetting link
[52290.191209] ata1: softreset failed (device not ready)
Do anyone have an idea of what the problem is?
I’m trying out a tumbleweed installation on one of them but it takes a day or two to guess if that improves the situation.