hi,
I’m currently trying to move from Ubuntu/Unity to Leap/kde. However, I’m facing some serious stability problems with Leap currently.
The setup:
Leap 42.2, fully updated
/ on a SSD (sda2, btrfs) partition, ~27GB, /home on a rotating disk (sdb4, xfs), ~380GB
(dual boot with ubuntu if that’s of concern)
Problem:
Several times now I’ve had the system become (almost) completely unresponsive after some action from my side - last time this happened when I was performing a search from within kmail.
Then, desktop is frozen, no mouse cursor, no response to keybord apart from Ctrl-Alt-F1 opens a console, which however is flooded by error messages, like
*blk_update_request: I/O error, dev sda, sector 63633368
BTRFS error (device sda2): bdev /dev/sda2 errs: wr 115, rd 46xxxxxx, flush 0; corrupt 0, gen 0
systemd-journald[475]: Failed to write entry (x items, y bytes), ignoring: Read-only file system
*No chance to enter any command. The only way I’ve found to return to normal operation is a power cycle - after which everything seems fine until the issue will come up again.
Any ideas what I could try before a complete new install (may be with ext4 this time). Is there a way to check the file system to some degree while mounted?
one more info - when checked from the ubuntu installation, btrfs tools do not find issues on the Leap root partition
karsten@xyz:~$ sudo btrfs check /dev/sda2
Checking filesystem on /dev/sda2
UUID: <....uuid of the partition....>
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 1090965199 bytes used err is 0
total csum bytes: 5553692
total tree bytes: 221282304
total fs tree bytes: 202833920
total extent tree bytes: 11075584
btree space waste bytes: 35715411
file data blocks allocated: 19602108416
referenced 7185379328
Btrfs v3.12
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
As I have no clue how to debug this (log files / journal not showing any traces of the strange behavior), I’m downloading a fresh image now, and will try my luck with a fresh install then…
If you haven’t done this, then it’s likely your SSD is full of unerased traps…
You should not only insert the discard option in your fstab, you should also manually invoke fstrim because your SSD is likely in such bad shape.
I would expect that the second or third run of fstrim would find much less data to be trimmed (if not zero). Does this indicate that TRIM does not work?
(Regarding discard as a mount option, I have not added this yet, since I had read that that’s discouraged in case of BTRFS volumes. Should I still set it up?)
(Also, I have not been successful upgrading my SSD’s firmware, as Corsair only provides Windows versions of their tools >:(, and another vendor’s Linux tool does not recognize the drive in contrast to some reports on Corsair’s forums)
Try running fstrim without specifying the partition
fstrim -a
You’d then be looking for a result
0 - All succeeded
1 - Failure
32 - Complete Failure
64 - Partial success and some failed
I’d also consider then rebooting and running fstrim again, if any files on your SSD change during the reboot you should be able to erase those traps as well.
I’d also question setting trim/discard to run weekly, but it’s a YMMV thing dependent on how your system is used.
Particularly when doing something major like replace one distro with another, that would be a major anomalous disk write event.
And, of course your overall SSD capacity is a factor, I think you’ve suggested in your posts it’s only 32GB, so major disk write events like a new install would be enormous.
As for whether discard/trim is implemented already…
Maybe, but it’s worth verifying. I only remember many years ago when I looked at this more deeply there were something like 3 highly recommended methods to implement so your install may be doing something I’m not personally aware of yet.
Now, with the drive hopefully well trimmed, would you expect that the un-trimmed state of the drive could have been the reason for the original problem? Since I reported it, the original issue has not re-appeared, but I still hestitate to setup all my environment as long as the issue may come back every moment…