I’m currently trying to move from Ubuntu/Unity to Leap/kde. However, I’m facing some serious stability problems with Leap currently.
- Leap 42.2, fully updated
- / on a SSD (sda2, btrfs) partition, ~27GB, /home on a rotating disk (sdb4, xfs), ~380GB
- (dual boot with ubuntu if that’s of concern)
Several times now I’ve had the system become (almost) completely unresponsive after some action from my side - last time this happened when I was performing a search from within kmail.
Then, desktop is frozen, no mouse cursor, no response to keybord apart from Ctrl-Alt-F1 opens a console, which however is flooded by error messages, like
*blk_update_request: I/O error, dev sda, sector 63633368
BTRFS error (device sda2): bdev /dev/sda2 errs: wr 115, rd 46xxxxxx, flush 0; corrupt 0, gen 0
systemd-journald: Failed to write entry (x items, y bytes), ignoring: Read-only file system
*No chance to enter any command. The only way I’ve found to return to normal operation is a power cycle - after which everything seems fine until the issue will come up again.
Any ideas what I could try before a complete new install (may be with ext4 this time). Is there a way to check the file system to some degree while mounted?
Thanks for your help!
one more info - when checked from the ubuntu installation, btrfs tools do not find issues on the Leap root partition
karsten@xyz:~$ sudo btrfs check /dev/sda2
Checking filesystem on /dev/sda2
UUID: <....uuid of the partition....>
checking free space cache
checking fs roots
checking root refs
found 1090965199 bytes used err is 0
total csum bytes: 5553692
total tree bytes: 221282304
total fs tree bytes: 202833920
total extent tree bytes: 11075584
btree space waste bytes: 35715411
file data blocks allocated: 19602108416
Have you run smartctl on the drive??
hi, thanks for helping!
Here’s the output of smartctl, not sure if it contains anything fishy:
smartctl 6.2 2013-11-07 r3856 [x86_64-linux-4.4.36-8-default] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: SandForce Driven SSDs
Device Model: Corsair Force GT
Serial Number: 11378203000007000200
LU WWN Device Id: 0 000000 000000000
Firmware Version: 1.3
User Capacity: 60.022.480.896 bytes [60,0 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jan 24 17:49:39 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
Total time to complete Offline
data collection: ( 2097) seconds.
Offline data collection
capabilities: (0x7f) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Abort Offline collection upon new
Offline surface scan supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 48) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0021) SCT Status supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 089 089 050 Pre-fail Always - 0/9659354
5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always - 0
9 Power_On_Hours_and_Msec 0x0032 095 095 000 Old_age Always - 4450h+56m+55.480s
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1171
171 Program_Fail_Count 0x0032 000 000 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline - 58
177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline - 3
181 Program_Fail_Count 0x0032 000 000 000 Old_age Always - 0
182 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 024 063 000 Old_age Always - 24 (Min/Max 13/63)
195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age Offline - 0/9659354
196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always - 0
201 Unc_Soft_Read_Err_Rate 0x001c 120 120 000 Old_age Offline - 0/9659354
204 Soft_ECC_Correct_Rate 0x001c 120 120 000 Old_age Offline - 0/9659354
230 Life_Curve_Status 0x0013 100 100 000 Pre-fail Always - 100
231 SSD_Life_Left 0x0013 100 100 010 Pre-fail Always - 0
233 SandForce_Internal 0x0000 000 000 000 Old_age Offline - 195
234 SandForce_Internal 0x0032 000 000 000 Old_age Always - 225
241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always - 225
242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always - 885
SMART Error Log not supported
SMART Self-test Log not supported
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Also I’ve run one pass of memtest86+, but no issues there.
Oops forgot you were running off a SSD. Typically smart is off and should be off to minimize Flash wear. So I don’t know :shame:
Well, I forced an extended SMART selftest
sudo smartctl -t long /dev/sda
This returned no errors as far as I can see:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
As I have no clue how to debug this (log files / journal not showing any traces of the strange behavior), I’m downloading a fresh image now, and will try my luck with a fresh install then…
Are you invoking trim/discard so that you are able to write to your SSD?
You should read the Arch Linux Wiki, note the sections that talk about trim/discard, and also note whether your particular SSD requires a firmware update
From the BTRFS FAQ
If you haven’t done this, then it’s likely your SSD is full of unerased traps…
You should not only insert the discard option in your fstab, you should also manually invoke fstrim because your SSD is likely in such bad shape.
fstrim is actually configured as a weekly cron job by Leap’s installation routine, however that cron job may not have had too many chances to run yet…
When I start fstrim manually, the results are somehow puzzling for my limited understanding:
karsten@xyz:~> sudo hdparm -I /dev/sda | grep TRIM
* Data Set Management TRIM supported (limit 1 block)
* Deterministic read data after TRIM
karsten@xyz:~> sudo fstrim -v /
/: 21 GiB (22490615808 bytes) trimmed
karsten@xyz:~> sudo fstrim -v /
/: 20,9 GiB (22461239296 bytes) trimmed
karsten@xyu:~> time sudo fstrim -v /
/: 20,9 GiB (22462488576 bytes) trimmed
I would expect that the second or third run of fstrim would find much less data to be trimmed (if not zero). Does this indicate that TRIM does not work?
(Regarding discard as a mount option, I have not added this yet, since I had read that that’s discouraged in case of BTRFS volumes. Should I still set it up?)
(Also, I have not been successful upgrading my SSD’s firmware, as Corsair only provides Windows versions of their tools >:(, and another vendor’s Linux tool does not recognize the drive in contrast to some reports on Corsair’s forums)
Try running fstrim without specifying the partition
You’d then be looking for a result
0 - All succeeded
1 - Failure
32 - Complete Failure
64 - Partial success and some failed
I’d also consider then rebooting and running fstrim again, if any files on your SSD change during the reboot you should be able to erase those traps as well.
I’d also question setting trim/discard to run weekly, but it’s a YMMV thing dependent on how your system is used.
Particularly when doing something major like replace one distro with another, that would be a major anomalous disk write event.
And, of course your overall SSD capacity is a factor, I think you’ve suggested in your posts it’s only 32GB, so major disk write events like a new install would be enormous.
As for whether discard/trim is implemented already…
Maybe, but it’s worth verifying. I only remember many years ago when I looked at this more deeply there were something like 3 highly recommended methods to implement so your install may be doing something I’m not personally aware of yet.
Ok, did that, including reboot & re-fstrim. Return code was 0 in both runs. The output of fstrim -v still puzzles me though…
sudo fstrim -av
/var/opt: 21 GiB (22499500032 bytes) trimmed
/var/log: 20,9 GiB (22470361088 bytes) trimmed
/var/lib/mysql: 20,9 GiB (22470770688 bytes) trimmed
/tmp: 21 GiB (22471081984 bytes) trimmed
/var/lib/machines: 21 GiB (22471376896 bytes) trimmed
/usr/local: 21 GiB (22471901184 bytes) trimmed
/boot/grub2/i386-pc: 21 GiB (22472163328 bytes) trimmed
/var/lib/mailman: 21 GiB (22472572928 bytes) trimmed
/srv: 21 GiB (22482087936 bytes) trimmed
/var/spool: 21 GiB (22482366464 bytes) trimmed
/var/tmp: 21 GiB (22482759680 bytes) trimmed
/var/crash: 21 GiB (22483054592 bytes) trimmed
/var/lib/pgsql: 21 GiB (22483316736 bytes) trimmed
/opt: 21 GiB (22483578880 bytes) trimmed
/var/cache: 21 GiB (22483906560 bytes) trimmed
/.snapshots: 21 GiB (22484598784 bytes) trimmed
/var/lib/libvirt/images: 21 GiB (22484598784 bytes) trimmed
/var/lib/named: 21 GiB (22484910080 bytes) trimmed
/boot/grub2/x86_64-efi: 21 GiB (22485209088 bytes) trimmed
/var/lib/mariadb: 20,9 GiB (22389501952 bytes) trimmed
/: 20,9 GiB (22390910976 bytes) trimmed
Now, with the drive hopefully well trimmed, would you expect that the un-trimmed state of the drive could have been the reason for the original problem? Since I reported it, the original issue has not re-appeared, but I still hestitate to setup all my environment as long as the issue may come back every moment…