Install on Dell E5420 (ssd): Display manager won't start. In dmesg: "btrfs: checksum failed"

I did clean install of openSUSE-Tumbleweed-NET-x86_64 Minimal X -installation on Dell E5420 Latitude laptop with 110 SSD. After install I could log to iceWM, install couple software like window manager (enlightenment) and display manager (lightdm) and set it as default display manager with Yast /etc/sysconfig -manager. After reboot lightdm won’t start. Same if I try kdm. (logs below). In dmesg there’s many: 141.628412] BTRFS: sda1 checksum verify failed on 157761536 wanted F5A9B189 found A8E340F4 . Does that mean hardware failure? What should I do? whole dmesg: http://paste.opensuse.org/43831531 Don’t see errors at boot.log: http://paste.opensuse.org/57519021

>lsblk

NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 111.8G  0 disk 
├─sda1   8:1    0    45G  0 part /
├─sda2   8:2    0     6G  0 part [SWAP]
└─sda3   8:3    0    10G  0 part /shared

sda1 is brtfs, sda3 xfs. grub installed on both sda1 (got boot flag) and mbr. other settings I left as defaults.

In Lightdm.log there’s 1 warning message:

+0.01s] WARNING: Error getting user list from org.freedesktop.Accounts: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown:  The name org.freedesktop.Accounts was not provided by any .service files

fstab and whole lightdm.log: http://paste.opensuse.org/9841259

Kdm.log:


Fatal server error:
Cannot open log file "/var/log/Xorg.0.log"

This is second try to make an installation. At first try it had same kind of problems. Managed to start lightdm and window manager couple of times, but then it boot only to command line and locked root as read-only.

Output of smartctl -a /dev/sda: http://paste.opensuse.org/66560378

Not necessarily (your smartctl output looks good AFAICT), but at least the filesystem is corrupted. This can also happen if you don’t shutdown the system cleanly, there was also a bug in 13.2 that caused the filesystem (not only btrfs, I experienced that with reiserfs myself) to get corrupted when resuming from hibernate (this has been fixed with an update).

I would suggest to boot from a LiveCD and run “btrfsck --repair /dev/sda1” to repair it.

If you don’t have a LiveCD handy, you can also abort the boot just before / is mounted by appending “rd.break” to the boot options (press ‘e’ at the grub menu and append that to the line starting with “linux” or “linuxefi”, then press F10 to boot).
You’d get into text mode then, unmount the root partition with “umount /sysroot”, and then run btrfsck.

Kdm.log:


Fatal server error:
Cannot open log file "/var/log/Xorg.0.log"

Well, Xorg cannot even open its log file. Probably caused by the filesystem corruptions.

There’s output of that command. http://paste.opensuse.org/79283372
Lot’s of corruptions found but it ends with wrong csum and seg fault. Anything else that could be done?

Is there some critical setting concerning ssd that I missed would have cause this?

At first after install there was message “Failed to start Create Volatile Files and Directories.” during boot. I managed to get that working by following instructions here: https://en.opensuse.org/SDB:SSD_performance

That’s bad.
Maybe trying with a newer version might help, e.g. booting a current Tumbleweed LiveCD?

Is there some critical setting concerning ssd that I missed would have cause this?

No.

At first after install there was message “Failed to start Create Volatile Files and Directories.” during boot. I managed to get that working by following instructions here: https://en.opensuse.org/SDB:SSD_performance

btrfs detects itself if you are using a SSD, and should adapt automatically without you having to configure anything.
A side-note: relatime should not be necessary, as it is the default since some time.

That error message is about the creation of tmp files and directories, i.e. the stuff in /usr/lib/tmpfiles.d/ and /etc/tmpfiles.d/. So apparently the filesystem was corrupted already on first boot? Did the installer crash or what?
Or there’s something severly wrong either with your SSD or btrfs (or its settings).

Anyway, you might try to reinstall using ext4 instead. Maybe this will avoid the problems (although btrfs should work too).

I didn’t monitoring the install process till end and when I returned it was already booting. So maybe there could’ve been crash during installation. There was a Windows installation on this computer before which seemed to work normally based on some testing which makes suspect ssd isn’t in that bad of condition. But I’ll try ext4 next.

I made a clean install with ext4 and everything seems to be working. Lightdm and Enlightenment run without problems and no errors found in dmesg or xsession-errors. Time will tell if this is going to last but looks good so far.

I used badlocks to wipe usb stick that I used for installer before writing the installer image again. I have used that stick for different installations and often formated partitions from it using just Gparted. I suspect Gparted / Imagewriter have not completely wiped it but left something that may have caused problems. Anyway, thanks for helping out :slight_smile:

Ok. So maybe you stumbled over a bug in btrfs. The Kernel shipped with 13.2 (3.16.6) had some that got fixed afterwards.

I suspect Gparted / Imagewriter have not completely wiped it but left something that may have caused problems.

I doubt that.
Imagewriter overwrites everything (well, at least according to the size of the image) on the stick, including the partition table.

It was Tumbleweed with kernel 4.1.6. I tried to make installation with ext4 partitions only using lvm. Followed instructions onhttp://https://www.suse.com/documentation/sles11/stor_admin/data/bi706ct.html . I made lvm partition, added volume group to it, and logical volumes to it including /. After I chose accept the summary didn’t show any LVs I made and warned there’s no root. Tried several times but same results. A helpful guy showed me in virtual box how to make installation with lvm and it was exactly how I did it. I ended up making installation without lvm which went without problems.

After running system about a day without hiccups at one boot Grub gave me warning about not detecting an operating system. I didn’t write down the exact error in dmesg I saw with live cd but it was very similar to this:

  376.360004] EXT4-fs (sda1): ext4_check_descriptors: Block bitmap for group 1 not in group (block 1026)!
  376.360480] EXT4-fs (sda1): group descriptors corrupted!

I run fsck -y on / partition. It founded and fixed lots of errors. But after that everything in / is now in lost+found. I didn’t lose anything valuable so retrieving files isn’t important. I run smartctl to find out health of my ssd, but I don’t see anything alarming:

linux:~ # smartctl -a /dev/sda
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.11.6-4-default] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     KINGSTON SV300S37A120G
Serial Number:    50026B7751107ACF
LU WWN Device Id: 5 0026b7 751107acf
Firmware Version: 600ABBF0
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Aug 30 11:24:51 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x7d) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Abort Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (  48) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x0025)    SCT Status supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   120   120   050    Old_age   Always       -       0/0
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   100   100   000    Old_age   Always       -       106h+36m+03.660s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       146
171 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       92
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       0
181 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0012   100   100   000    Old_age   Always       -       0
189 Airflow_Temperature_Cel 0x0000   039   047   000    Old_age   Offline      -       39 (Min/Max 21/47)
194 Temperature_Celsius     0x0022   039   047   000    Old_age   Always       -       39 (Min/Max 21/47)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/0
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       0/0
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       0/0
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0000   100   100   011    Old_age   Offline      -       0
233 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       126
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       124
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       124
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       130
244 Unknown_Attribute       0x0000   100   100   010    Old_age   Offline      -       720897

SMART Error Log not supported

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%       106         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       107         -
# 2  Short offline       Completed without error       00%       106         -
# 3  Conveyance offline  Completed without error       00%       106         -

I’m running out of ideas what to try next. Any ideas?

There’s my BIOS settings if there’s a change something might be relevant:

Boot list option > Legacy (not UEFI), System Management: Disabled (other options: alert only, asf 2.0, dash/asf 2.0), SATA Operation: AHCI (other: disabled, ATA), [x] enable eSATA Ports, [x] Enable Hard Drive Free Fall Protection, Drives: enabled SATA-0,1,4,5. Integrated NIC: Enabled w/PXE (other disabled, enabled). Performance: [x] Enable Intel SpeedStep, [x] C states control, [x] Enable intel TurboBoost, [x] Enable HyperThread control, Fastboot: thorough (other: minimal, auto).

At installation, I did not not change other mounting options for partitions than mount point, size and file system. Other settings in installation were left default but installed grub on both MBR and /. DE option: “minimal X”. I did not use encryption. Installation didn’t crash and went till end without error messages except if I try to use lvm (described before).

With LVM you need a separate boot partition ext2/4 about 500 meg

The link to the instruction page you used is corrupted so can’t say if the instructions are right

I found out that later and tried with creating separate ext4 primary /boot partition and installer included it in the summary, but the summary were otherwise identical: it still didn’t list any logical partitions and complained lack of /. Original link to instructions: http://https://www.suse.com/documentation/sles11/stor_admin/data/lvm.html

Btw. I run badblocks -v on ssd (/dev/sda) and it reported 0 errors.

That is a bad link it does not work

Is drive formatted as DOS (legacy) or GPT partitioning. With GPT there is no such thing as extended or logical

This should work: https://www.suse.com/documentation/sles11/stor_admin/?page=/documentation/sles11/stor_admin/data/lvm.html
It’s formatted as DOS partitioning.
If it has any relevance, computer previously had Win10 with 3 NTFS partitions, and I just removed them with Yast installer.

Note that though the instructions are in general fine they are for SUSE not openSUSE and in general seem to be for after an install. I assumed you were trying to use the installer???

Yes that’s true. Manageable workaround for lvm problem is just not to use it. Those file system corruptions are the show stoppers.

It seems that something is corrupting your SSD. If it’d not a hardware problem, It may be worth trying to disable fastboot in BIOS, and perhaps e-SATA ports (but do keep AHCI SATA mode).

Do you think the corruption may be related to suspension or hibernation?

I’ve fastboot set to “thorough” (others:minimal, auto), which explanation that it won’t skip parts in boot. So I guess it’s disabled. I didn’t use hibernation much, just to tested if it worked and it did. Used suspend quite often too. Didn’t notice anything wrong with those.

I made new install with no lvm and only ext4 to see if I could get more info about corruptions if those happend. After over a day it’s working. Made couple of things differently just for a try: manually made a new msdos partition table with installer (hard disks > sda > Expert… > create new partition table). Last time I left part of ssd unallocated, now formated whole drive. I have used NFS and SSHFS to copy settings files from my desktop. Don’t know if those could be related to my problems. They worked and didn’t give errors. I might have forgotten to unmount partitions when using SSHFS do suspend/reboot if that matters. This time I only used external hdd.

Warnings/fails I see now in dmesg, probably nothing relevant:

    2.362193] i8042: Warning: Keylock active
    7.837818] systemd-journald[155]: Failed to set file attributes: Inappropriate ioctl for device
   11.192269] firewire_ohci 0000:09:00.0: register access failure
    17.265238] ACPI Warning: SystemIO range  0x0000000000000428-0x000000000000042F conflicts with OpRegion  0x0000000000000400-0x000000000000047F (\PMIO) (20150410/utaddress-254)
   17.272678] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
    17.274695] ACPI Warning: SystemIO range  0x0000000000000540-0x000000000000054F conflicts with OpRegion  0x0000000000000500-0x0000000000000563 (\GPIO) (20150410/utaddress-254)
   17.278236] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
    17.279917] ACPI Warning: SystemIO range  0x0000000000000530-0x000000000000053F conflicts with OpRegion  0x0000000000000500-0x0000000000000563 (\GPIO) (20150410/utaddress-254)
   17.282531] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
    17.284655] ACPI Warning: SystemIO range  0x0000000000000500-0x000000000000052F conflicts with OpRegion  0x0000000000000500-0x0000000000000563 (\GPIO) (20150410/utaddress-254)
   17.339205] lpc_ich: Resource conflict(s) found affecting gpio_ich
    17.352713] ACPI Warning: SystemIO range  0x0000000000007040-0x000000000000705F conflicts with OpRegion  0x0000000000007040-0x000000000000704F (\_SB_.PCI0.SBUS.SMBI)  (20150410/utaddress-254)
   17.366724] iwlwifi 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control
 1133.169001] i915 0000:00:02.0: BAR 6: ??? 0x00000000 flags 0x2] has bogus alignment
 1133.169037] i915 0000:00:02.0: BAR 6: ??? 0x00000000 flags 0x2] has bogus alignment

# And in boot.log:
 INFO ] PNFS blkmaping enablement. is not active.
[DEPEND] Dependency failed for pNFS block layout mapping daemon.

Got “unknown file system” again on boot. On live cd:

> fsck /dev/sda1
fsck from utils-linux 2.23.2
e2fsck 1.42.8 (20-Jun-2013)
ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
fsck.ext4: Group descriptors look bad.. trying backup blocks..
Block bitmap for group 0 is not in group (block 65569152)
Relocate?<y>?

…and same for many, many blocks. Badblocks didn’t found any errors.

Guess I could try to install windows and manufacturer’s own diagnostic tool what only works for windows to get info if the ssd is really broken or not.

There is definitely something funky with that drive. Did you turn on smart? My SSD came with smart turned off. I left it off to reduce extra writes to the drive.