Compulab fitlet2: disk problem

I have two Compulab Fitlet2. They are small, semi-rugged PCs with an Atom CPU.

They both have disk problems after some hours (as short as 40 minutes, as long as 15 hours) when they end up with this is kernel log:

 1868.283972] ata1.00: exception Emask 0x0 SAct 0x300 SErr 0x0 action 0x6 frozen
 1868.283977] ata1.00: failed command: WRITE FPDMA QUEUED
 1868.283986] ata1.00: cmd 61/58:40:78:e9:53/00:00:01:00:00/40 tag 8 ncq dma 45056 out
                        res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
 1868.283988] ata1.00: status: { DRDY }
 1868.283991] ata1.00: failed command: WRITE FPDMA QUEUED
 1868.283998] ata1.00: cmd 61/08:48:40:d1:23/00:00:00:00:00/40 tag 9 ncq dma 4096 out
                        res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
 1868.284000] ata1.00: status: { DRDY }
 1868.284006] ata1: hard resetting link
 1878.299458] ata1: softreset failed (device not ready)
 1878.299465] ata1: hard resetting link

and from there it rapidly goes wrong:

 1928.562698] EXT4-fs (sda2): Remounting filesystem read-only

This happens on both boxes.
I have searched the internet for possible causes:

  • Bad power supply: since it happens on both boxes that seems unlikely

  • Bad disk:since it happens on both boxes that seems unlikely

  • Similar errors can be due to bad Marvell firmware, but these boxes don’t use that.

  • Heavy load and NCQ bug: these machines were idle (left overnight)

    Nothing interesting from “smartctl -a”

The Linux Mint that Compulab usually installs allegedly don’t have these problems.

kernel:

Linux linux-gj0i 4.12.14-lp150.12.16-default #1 SMP Tue Aug 14 17:51:27 UTC 2018 (28574e6) x86_64 x86_64 x86_64 GNU/Linux

lspci:

linux-gj0i:~ # lspci
00:00.0 Host bridge: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series Host Bridge (rev 0b)
00:02.0 VGA compatible controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series Integrated Graphics Controller (rev 0b)
00:0e.0 Audio device: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series Audio Cluster (rev 0b)
00:0f.0 Communication controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series Trusted Execution Engine (rev 0b)
00:12.0 SATA controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SATA AHCI Controller (rev 0b)
00:13.0 PCI bridge: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port A #1 (rev fb)
00:13.1 PCI bridge: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port A #2 (rev fb)
00:13.2 PCI bridge: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port A #3 (rev fb)
00:15.0 USB controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series USB xHCI (rev 0b)
00:18.0 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series HSUART Controller #1 (rev 0b)
00:18.1 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series HSUART Controller #2 (rev 0b)
00:18.2 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series HSUART Controller #3 (rev 0b)
00:18.3 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series HSUART Controller #4 (rev 0b)
00:19.0 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SPI Controller #1 (rev 0b)
00:19.1 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SPI Controller #2 (rev 0b)
00:19.2 Signal processing controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SPI Controller #3 (rev 0b)
00:1b.0 SD Host controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SDXC/MMC Host Controller (rev 0b)
00:1c.0 SD Host controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series eMMC Controller (rev 0b)
00:1e.0 SD Host controller: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SDIO Controller (rev 0b)
00:1f.0 ISA bridge: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series Low Pin Count Interface (rev 0b)
00:1f.1 SMBus: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series SMBus Controller (rev 0b)
01:00.0 Network controller: Intel Corporation Wireless 8260 (rev 3a)
02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

smartctl:


linux-gj0i:~ # smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp150.11-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     NT-128
Serial Number:    087110270002
LU WWN Device Id: 0 000000 000000000
Firmware Version: Q0407A
User Capacity:    126,701,535,232 bytes [126 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Aug 17 12:27:52 2018 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   1) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0000   100   100   050    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0002   100   100   050    Old_age   Always       -       0
  9 Power_On_Hours          0x0000   100   100   050    Old_age   Offline      -       0
 12 Power_Cycle_Count       0x0000   100   100   050    Old_age   Offline      -       4
160 Unknown_Attribute       0x0000   100   100   050    Old_age   Offline      -       0
161 Unknown_Attribute       0x0000   100   100   050    Old_age   Offline      -       61
162 Unknown_Attribute       0x0000   100   100   050    Old_age   Offline      -       1
163 Unknown_Attribute       0x0000   100   100   050    Old_age   Offline      -       7
164 Unknown_Attribute       0x0000   100   100   050    Old_age   Offline      -       31
165 Unknown_Attribute       0x0000   100   100   050    Old_age   Offline      -       1
166 Unknown_Attribute       0x0000   100   100   050    Old_age   Offline      -       0
167 Unknown_Attribute       0x0000   100   100   050    Old_age   Offline      -       0
168 Unknown_Attribute       0x0000   100   100   050    Old_age   Offline      -       3000
169 Unknown_Attribute       0x0000   100   100   050    Old_age   Offline      -       100
192 Power-Off_Retract_Count 0x0000   100   100   050    Old_age   Offline      -       4
194 Temperature_Celsius     0x0000   100   100   050    Old_age   Offline      -       39
195 Hardware_ECC_Recovered  0x0000   100   100   050    Old_age   Offline      -       0
196 Reallocated_Event_Count 0x0000   100   100   050    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0000   100   100   050    Old_age   Offline      -       0
241 Total_LBAs_Written      0x0000   100   100   050    Old_age   Offline      -       148
242 Total_LBAs_Read         0x0000   100   100   050    Old_age   Offline      -       19
245 Unknown_Attribute       0x0000   100   100   050    Old_age   Offline      -       155

Warning! SMART ATA Error Log Structure error: invalid SMART checksum.
SMART Error Log Version: 1
Invalid Error Log index = 0x20 (T13/1321D rev 1c Section 8.41.6.8.2.2 gives valid range from 1 to 5)

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

Kernel log / dmesg: (truncated due to post size limit)


linux-gj0i:~ # dmesg
...
   12.692225] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
   12.909416] NET: Registered protocol family 17
 1543.174153] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[52270.160098] ata1.00: exception Emask 0x0 SAct 0x1fe00000 SErr 0x0 action 0x6 frozen
[52270.160104] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160112] ata1.00: cmd 61/08:a8:00:a8:0f/00:00:00:00:00/40 tag 21 ncq dma 4096 out
                        res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160114] ata1.00: status: { DRDY }
[52270.160117] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160140] ata1.00: cmd 61/08:b0:80:a8:8f/00:00:00:00:00/40 tag 22 ncq dma 4096 out
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160141] ata1.00: status: { DRDY }
[52270.160143] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160149] ata1.00: cmd 61/08:b8:f8:aa:8f/00:00:00:00:00/40 tag 23 ncq dma 4096 out
                        res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160150] ata1.00: status: { DRDY }
[52270.160152] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160158] ata1.00: cmd 61/08:c0:18:ac:8f/00:00:00:00:00/40 tag 24 ncq dma 4096 out
                        res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160159] ata1.00: status: { DRDY }
[52270.160161] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160167] ata1.00: cmd 61/08:c8:48:ac:8f/00:00:00:00:00/40 tag 25 ncq dma 4096 out
                        res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160169] ata1.00: status: { DRDY }
[52270.160170] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160176] ata1.00: cmd 61/08:d0:70:ac:8f/00:00:00:00:00/40 tag 26 ncq dma 4096 out
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160178] ata1.00: status: { DRDY }
[52270.160179] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160185] ata1.00: cmd 61/08:d8:38:ac:90/00:00:00:00:00/40 tag 27 ncq dma 4096 out
                        res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160187] ata1.00: status: { DRDY }
[52270.160188] ata1.00: failed command: WRITE FPDMA QUEUED
[52270.160194] ata1.00: cmd 61/28:e0:60:7a:54/00:00:01:00:00/40 tag 28 ncq dma 20480 out
                        res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[52270.160196] ata1.00: status: { DRDY }
[52270.160201] ata1: hard resetting link
[52280.175583] ata1: softreset failed (device not ready)
[52280.175590] ata1: hard resetting link
[52290.191209] ata1: softreset failed (device not ready)

Do anyone have an idea of what the problem is?

I’m trying out a tumbleweed installation on one of them but it takes a day or two to guess if that improves the situation.

Has there been any resolution to this? I’m having exactly this problem on Tumbleweed. I’m currently running Linux Mint without problems,but I would much rather use Tumbleweed or LEAP.

Our solution was to take the pre-installed SSD and smash it with a hammer. And then install a name-brand SSD (Trascend/Intel/…). Works fine with that.

I did track down who made the original SSD and I seem to recall it was a company not known for making SSDs, so I guess Compulab got a good deal there and it worked with the mint/ubuntu kernel at that time.

Transcend is a rather small company for SSDs.
Often SSDs with Phison controllers from unknown brands are made by Phison.
OTOH, Samsung SATA SSDs have a lot of problems with different hardware.