Results 1 to 10 of 10

Thread: system unresponsive, flood of error messages

  1. #1
    Join Date
    Jan 2017
    Location
    Nuremberg, Germany
    Posts
    12

    Default system unresponsive, flood of error messages

    hi,
    I'm currently trying to move from Ubuntu/Unity to Leap/kde. However, I'm facing some serious stability problems with Leap currently.

    The setup:
    - Leap 42.2, fully updated
    - / on a SSD (sda2, btrfs) partition, ~27GB, /home on a rotating disk (sdb4, xfs), ~380GB
    - (dual boot with ubuntu if that's of concern)

    Problem:
    Several times now I've had the system become (almost) completely unresponsive after some action from my side - last time this happened when I was performing a search from within kmail.
    Then, desktop is frozen, no mouse cursor, no response to keybord apart from Ctrl-Alt-F1 opens a console, which however is flooded by error messages, like
    blk_update_request: I/O error, dev sda, sector 63633368
    BTRFS error (device sda2): bdev /dev/sda2 errs: wr 115, rd 46xxxxxx, flush 0; corrupt 0, gen 0
    systemd-journald[475]: Failed to write entry (x items, y bytes), ignoring: Read-only file system
    No chance to enter any command. The only way I've found to return to normal operation is a power cycle - after which everything seems fine until the issue will come up again.
    Any ideas what I could try before a complete new install (may be with ext4 this time). Is there a way to check the file system to some degree while mounted?

    Thanks for your help!
    Karsten

  2. #2
    Join Date
    Jan 2017
    Location
    Nuremberg, Germany
    Posts
    12

    Default Re: system unresponsive, flood of error messages

    one more info - when checked from the ubuntu installation, btrfs tools do not find issues on the Leap root partition

    Code:
    karsten@xyz:~$ sudo btrfs check /dev/sda2
    Checking filesystem on /dev/sda2
    UUID: <....uuid of the partition....>
    checking extents
    checking free space cache
    checking fs roots
    checking csums
    checking root refs
    found 1090965199 bytes used err is 0
    total csum bytes: 5553692
    total tree bytes: 221282304
    total fs tree bytes: 202833920
    total extent tree bytes: 11075584
    btree space waste bytes: 35715411
    file data blocks allocated: 19602108416
     referenced 7185379328
    Btrfs v3.12

  3. #3
    Join Date
    Nov 2009
    Location
    West Virginia Sector 13
    Posts
    15,703

    Default Re: system unresponsive, flood of error messages

    Have you run smartctl on the drive??

  4. #4
    Join Date
    Jan 2017
    Location
    Nuremberg, Germany
    Posts
    12

    Default Re: system unresponsive, flood of error messages

    hi, thanks for helping!

    Here's the output of smartctl, not sure if it contains anything fishy:

    Code:
    smartctl 6.2 2013-11-07 r3856 [x86_64-linux-4.4.36-8-default] (SUSE RPM)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Model Family:     SandForce Driven SSDs
    Device Model:     Corsair Force GT
    Serial Number:    11378203000007000200
    LU WWN Device Id: 0 000000 000000000
    Firmware Version: 1.3
    User Capacity:    60.022.480.896 bytes [60,0 GB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    Solid State Device
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
    SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Tue Jan 24 17:49:39 2017 CET
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x00)    Offline data collection activity
                        was never started.
                        Auto Offline Data Collection: Disabled.
    Self-test execution status:      (   0)    The previous self-test routine completed
                        without error or no self-test has ever 
                        been run.
    Total time to complete Offline 
    data collection:         ( 2097) seconds.
    Offline data collection
    capabilities:              (0x7f) SMART execute Offline immediate.
                        Auto Offline data collection on/off support.
                        Abort Offline collection upon new
                        command.
                        Offline surface scan supported.
                        Self-test supported.
                        Conveyance Self-test supported.
                        Selective Self-test supported.
    SMART capabilities:            (0x0003)    Saves SMART data before entering
                        power-saving mode.
                        Supports SMART auto save timer.
    Error logging capability:        (0x01)    Error logging supported.
                        General Purpose Logging supported.
    Short self-test routine 
    recommended polling time:      (   1) minutes.
    Extended self-test routine
    recommended polling time:      (  48) minutes.
    Conveyance self-test routine
    recommended polling time:      (   2) minutes.
    SCT capabilities:            (0x0021)    SCT Status supported.
                        SCT Data Table supported.
    
    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   089   089   050    Pre-fail  Always       -       0/9659354
      5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
      9 Power_On_Hours_and_Msec 0x0032   095   095   000    Old_age   Always       -       4450h+56m+55.480s
     12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1171
    171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
    172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
    174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       58
    177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       3
    181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
    182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
    194 Temperature_Celsius     0x0022   024   063   000    Old_age   Always       -       24 (Min/Max 13/63)
    195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/9659354
    196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
    201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       0/9659354
    204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       0/9659354
    230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
    231 SSD_Life_Left           0x0013   100   100   010    Pre-fail  Always       -       0
    233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       195
    234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       225
    241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       225
    242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       885
    
    SMART Error Log not supported
    
    SMART Self-test Log not supported
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    Also I've run one pass of memtest86+, but no issues there.

    regards -
    Karsten

  5. #5
    Join Date
    Nov 2009
    Location
    West Virginia Sector 13
    Posts
    15,703

    Default Re: system unresponsive, flood of error messages

    Oops forgot you were running off a SSD. Typically smart is off and should be off to minimize Flash wear. So I don't know

  6. #6
    Join Date
    Jan 2017
    Location
    Nuremberg, Germany
    Posts
    12

    Default Re: system unresponsive, flood of error messages

    Well, I forced an extended SMART selftest
    Code:
    sudo smartctl -t long /dev/sda
    This returned no errors as far as I can see:
    Code:
    Offline data collection status:  (0x02) Offline data collection activity
                                            was completed without error.
                                            Auto Offline Data Collection: Disabled.
    As I have no clue how to debug this (log files / journal not showing any traces of the strange behavior), I'm downloading a fresh image now, and will try my luck with a fresh install then..

  7. #7
    Join Date
    Jun 2008
    Location
    San Diego, Ca, USA
    Posts
    11,133
    Blog Entries
    2

    Default Re: system unresponsive, flood of error messages

    Are you invoking trim/discard so that you are able to write to your SSD?

    You should read the Arch Linux Wiki, note the sections that talk about trim/discard, and also note whether your particular SSD requires a firmware update
    https://wiki.archlinux.org/index.php/Solid_State_Drives

    From the BTRFS FAQ
    https://btrfs.wiki.kernel.org/index....M.2Fdiscard.3F

    If you haven't done this, then it's likely your SSD is full of unerased traps...
    You should not only insert the discard option in your fstab, you should also manually invoke fstrim because your SSD is likely in such bad shape.

    TSU
    Beginner Wiki Quickstart - https://en.opensuse.org/User:Tsu2/Quickstart_Wiki
    Solved a problem recently? Create a wiki page for future personal reference!
    Learn something new?
    Attended a computing event?
    Post and Share!

  8. #8
    Join Date
    Jan 2017
    Location
    Nuremberg, Germany
    Posts
    12

    Default Re: system unresponsive, flood of error messages

    hi TSU,

    fstrim is actually configured as a weekly cron job by Leap's installation routine, however that cron job may not have had too many chances to run yet..

    When I start fstrim manually, the results are somehow puzzling for my limited understanding:

    Code:
    karsten@xyz:~> sudo hdparm -I /dev/sda | grep TRIM
               *    Data Set Management TRIM supported (limit 1 block)
               *    Deterministic read data after TRIM

    Code:
    karsten@xyz:~> sudo fstrim -v /
    /: 21 GiB (22490615808 bytes) trimmed
    karsten@xyz:~> sudo fstrim -v /
    /: 20,9 GiB (22461239296 bytes) trimmed
    karsten@xyu:~> time sudo fstrim -v /
    /: 20,9 GiB (22462488576 bytes) trimmed
    
    real    0m52.948s
    ..
    I would expect that the second or third run of fstrim would find much less data to be trimmed (if not zero). Does this indicate that TRIM does not work?

    (Regarding discard as a mount option, I have not added this yet, since I had read that that's discouraged in case of BTRFS volumes. Should I still set it up?)
    (Also, I have not been successful upgrading my SSD's firmware, as Corsair only provides Windows versions of their tools , and another vendor's Linux tool does not recognize the drive in contrast to some reports on Corsair's forums)

    regards -
    Karsten

  9. #9
    Join Date
    Jun 2008
    Location
    San Diego, Ca, USA
    Posts
    11,133
    Blog Entries
    2

    Default Re: system unresponsive, flood of error messages

    Try running fstrim without specifying the partition
    Code:
    fstrim -a
    You'd then be looking for a result
    0 - All succeeded
    1 - Failure
    32 - Complete Failure
    64 - Partial success and some failed

    I'd also consider then rebooting and running fstrim again, if any files on your SSD change during the reboot you should be able to erase those traps as well.

    I'd also question setting trim/discard to run weekly, but it's a YMMV thing dependent on how your system is used.
    Particularly when doing something major like replace one distro with another, that would be a major anomalous disk write event.
    And, of course your overall SSD capacity is a factor, I think you've suggested in your posts it's only 32GB, so major disk write events like a new install would be enormous.

    As for whether discard/trim is implemented already...
    Maybe, but it's worth verifying. I only remember many years ago when I looked at this more deeply there were something like 3 highly recommended methods to implement so your install may be doing something I'm not personally aware of yet.

    TSU
    Beginner Wiki Quickstart - https://en.opensuse.org/User:Tsu2/Quickstart_Wiki
    Solved a problem recently? Create a wiki page for future personal reference!
    Learn something new?
    Attended a computing event?
    Post and Share!

  10. #10
    Join Date
    Jan 2017
    Location
    Nuremberg, Germany
    Posts
    12

    Default Re: system unresponsive, flood of error messages

    Ok, did that, including reboot & re-fstrim. Return code was 0 in both runs. The output of fstrim -v still puzzles me though..

    Code:
    sudo fstrim -av
    root's password:
    /var/opt: 21 GiB (22499500032 bytes) trimmed
    /var/log: 20,9 GiB (22470361088 bytes) trimmed
    /var/lib/mysql: 20,9 GiB (22470770688 bytes) trimmed
    /tmp: 21 GiB (22471081984 bytes) trimmed
    /var/lib/machines: 21 GiB (22471376896 bytes) trimmed
    /usr/local: 21 GiB (22471901184 bytes) trimmed
    /boot/grub2/i386-pc: 21 GiB (22472163328 bytes) trimmed
    /var/lib/mailman: 21 GiB (22472572928 bytes) trimmed
    /srv: 21 GiB (22482087936 bytes) trimmed
    /var/spool: 21 GiB (22482366464 bytes) trimmed
    /var/tmp: 21 GiB (22482759680 bytes) trimmed
    /var/crash: 21 GiB (22483054592 bytes) trimmed
    /var/lib/pgsql: 21 GiB (22483316736 bytes) trimmed
    /opt: 21 GiB (22483578880 bytes) trimmed
    /var/cache: 21 GiB (22483906560 bytes) trimmed
    /.snapshots: 21 GiB (22484598784 bytes) trimmed
    /var/lib/libvirt/images: 21 GiB (22484598784 bytes) trimmed
    /var/lib/named: 21 GiB (22484910080 bytes) trimmed
    /boot/grub2/x86_64-efi: 21 GiB (22485209088 bytes) trimmed
    /var/lib/mariadb: 20,9 GiB (22389501952 bytes) trimmed
    /: 20,9 GiB (22390910976 bytes) trimmed
    Now, with the drive hopefully well trimmed, would you expect that the un-trimmed state of the drive could have been the reason for the original problem? Since I reported it, the original issue has not re-appeared, but I still hestitate to setup all my environment as long as the issue may come back every moment..

    thanks-
    Karsten

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •