system unresponsive, flood of error messages

stromwirbel · January 24, 2017, 1:57pm

hi,
I’m currently trying to move from Ubuntu/Unity to Leap/kde. However, I’m facing some serious stability problems with Leap currently.

The setup:

Leap 42.2, fully updated
/ on a SSD (sda2, btrfs) partition, ~27GB, /home on a rotating disk (sdb4, xfs), ~380GB
(dual boot with ubuntu if that’s of concern)

Problem:
Several times now I’ve had the system become (almost) completely unresponsive after some action from my side - last time this happened when I was performing a search from within kmail.
Then, desktop is frozen, no mouse cursor, no response to keybord apart from Ctrl-Alt-F1 opens a console, which however is flooded by error messages, like
*blk_update_request: I/O error, dev sda, sector 63633368
BTRFS error (device sda2): bdev /dev/sda2 errs: wr 115, rd 46xxxxxx, flush 0; corrupt 0, gen 0
systemd-journald[475]: Failed to write entry (x items, y bytes), ignoring: Read-only file system
*No chance to enter any command. The only way I’ve found to return to normal operation is a power cycle - after which everything seems fine until the issue will come up again.
Any ideas what I could try before a complete new install (may be with ext4 this time). Is there a way to check the file system to some degree while mounted?

Thanks for your help!
Karsten

stromwirbel · January 24, 2017, 2:50pm

one more info - when checked from the ubuntu installation, btrfs tools do not find issues on the Leap root partition

karsten@xyz:~$ sudo btrfs check /dev/sda2
Checking filesystem on /dev/sda2
UUID: <....uuid of the partition....>
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 1090965199 bytes used err is 0
total csum bytes: 5553692
total tree bytes: 221282304
total fs tree bytes: 202833920
total extent tree bytes: 11075584
btree space waste bytes: 35715411
file data blocks allocated: 19602108416
 referenced 7185379328
Btrfs v3.12

gogalthorp · January 24, 2017, 3:59pm

Have you run smartctl on the drive??

stromwirbel · January 24, 2017, 6:01pm

hi, thanks for helping!

Here’s the output of smartctl, not sure if it contains anything fishy:

smartctl 6.2 2013-11-07 r3856 [x86_64-linux-4.4.36-8-default] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     Corsair Force GT
Serial Number:    11378203000007000200
LU WWN Device Id: 0 000000 000000000
Firmware Version: 1.3
User Capacity:    60.022.480.896 bytes [60,0 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan 24 17:49:39 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         ( 2097) seconds.
Offline data collection
capabilities:              (0x7f) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Abort Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (  48) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x0021)    SCT Status supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   089   089   050    Pre-fail  Always       -       0/9659354
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   095   095   000    Old_age   Always       -       4450h+56m+55.480s
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1171
171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       58
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       3
181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   024   063   000    Old_age   Always       -       24 (Min/Max 13/63)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/9659354
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       0/9659354
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       0/9659354
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0013   100   100   010    Pre-fail  Always       -       0
233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       195
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       225
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       225
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       885

SMART Error Log not supported

SMART Self-test Log not supported

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Also I’ve run one pass of memtest86+, but no issues there.

regards -
Karsten

gogalthorp · January 24, 2017, 7:38pm

Oops forgot you were running off a SSD. Typically smart is off and should be off to minimize Flash wear. So I don’t know :shame:

stromwirbel · January 24, 2017, 9:20pm

Well, I forced an extended SMART selftest

sudo smartctl -t long /dev/sda

This returned no errors as far as I can see:

Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.

As I have no clue how to debug this (log files / journal not showing any traces of the strange behavior), I’m downloading a fresh image now, and will try my luck with a fresh install then…

tsu2 · January 25, 2017, 5:59am

Are you invoking trim/discard so that you are able to write to your SSD?

You should read the Arch Linux Wiki, note the sections that talk about trim/discard, and also note whether your particular SSD requires a firmware update
https://wiki.archlinux.org/index.php/Solid_State_Drives

From the BTRFS FAQ
https://btrfs.wiki.kernel.org/index.php/FAQ#Does_Btrfs_support_TRIM.2Fdiscard.3F

If you haven’t done this, then it’s likely your SSD is full of unerased traps…
You should not only insert the discard option in your fstab, you should also manually invoke fstrim because your SSD is likely in such bad shape.

TSU

stromwirbel · January 25, 2017, 5:44pm

hi TSU,

fstrim is actually configured as a weekly cron job by Leap’s installation routine, however that cron job may not have had too many chances to run yet…

When I start fstrim manually, the results are somehow puzzling for my limited understanding:

karsten@xyz:~> sudo hdparm -I /dev/sda | grep TRIM
           *    Data Set Management TRIM supported (limit 1 block)
           *    Deterministic read data after TRIM

karsten@xyz:~> sudo fstrim -v /
/: 21 GiB (22490615808 bytes) trimmed
karsten@xyz:~> sudo fstrim -v /
/: 20,9 GiB (22461239296 bytes) trimmed
karsten@xyu:~> time sudo fstrim -v /
/: 20,9 GiB (22462488576 bytes) trimmed

real    0m52.948s
..

I would expect that the second or third run of fstrim would find much less data to be trimmed (if not zero). Does this indicate that TRIM does not work?

(Regarding discard as a mount option, I have not added this yet, since I had read that that’s discouraged in case of BTRFS volumes. Should I still set it up?)
(Also, I have not been successful upgrading my SSD’s firmware, as Corsair only provides Windows versions of their tools >:(, and another vendor’s Linux tool does not recognize the drive in contrast to some reports on Corsair’s forums)

regards -
Karsten

tsu2 · January 25, 2017, 7:26pm

Try running fstrim without specifying the partition

fstrim -a

You’d then be looking for a result
0 - All succeeded
1 - Failure
32 - Complete Failure
64 - Partial success and some failed

I’d also consider then rebooting and running fstrim again, if any files on your SSD change during the reboot you should be able to erase those traps as well.

I’d also question setting trim/discard to run weekly, but it’s a YMMV thing dependent on how your system is used.
Particularly when doing something major like replace one distro with another, that would be a major anomalous disk write event.
And, of course your overall SSD capacity is a factor, I think you’ve suggested in your posts it’s only 32GB, so major disk write events like a new install would be enormous.

As for whether discard/trim is implemented already…
Maybe, but it’s worth verifying. I only remember many years ago when I looked at this more deeply there were something like 3 highly recommended methods to implement so your install may be doing something I’m not personally aware of yet.

TSU

stromwirbel · January 25, 2017, 9:35pm

Ok, did that, including reboot & re-fstrim. Return code was 0 in both runs. The output of fstrim -v still puzzles me though…

sudo fstrim -av
root's password:
/var/opt: 21 GiB (22499500032 bytes) trimmed
/var/log: 20,9 GiB (22470361088 bytes) trimmed
/var/lib/mysql: 20,9 GiB (22470770688 bytes) trimmed
/tmp: 21 GiB (22471081984 bytes) trimmed
/var/lib/machines: 21 GiB (22471376896 bytes) trimmed
/usr/local: 21 GiB (22471901184 bytes) trimmed
/boot/grub2/i386-pc: 21 GiB (22472163328 bytes) trimmed
/var/lib/mailman: 21 GiB (22472572928 bytes) trimmed
/srv: 21 GiB (22482087936 bytes) trimmed
/var/spool: 21 GiB (22482366464 bytes) trimmed
/var/tmp: 21 GiB (22482759680 bytes) trimmed
/var/crash: 21 GiB (22483054592 bytes) trimmed
/var/lib/pgsql: 21 GiB (22483316736 bytes) trimmed
/opt: 21 GiB (22483578880 bytes) trimmed
/var/cache: 21 GiB (22483906560 bytes) trimmed
/.snapshots: 21 GiB (22484598784 bytes) trimmed
/var/lib/libvirt/images: 21 GiB (22484598784 bytes) trimmed
/var/lib/named: 21 GiB (22484910080 bytes) trimmed
/boot/grub2/x86_64-efi: 21 GiB (22485209088 bytes) trimmed
/var/lib/mariadb: 20,9 GiB (22389501952 bytes) trimmed
/: 20,9 GiB (22390910976 bytes) trimmed

Now, with the drive hopefully well trimmed, would you expect that the un-trimmed state of the drive could have been the reason for the original problem? Since I reported it, the original issue has not re-appeared, but I still hestitate to setup all my environment as long as the issue may come back every moment…

thanks-
Karsten