Boot stuck after tumbleweed upgrade: start task is running for disk

tranquilreed · September 8, 2024, 2:30am

I am currently still running the Tumbleweed release 20240624 because I have tried and failed to upgrade the system. After every upgrade, the boot becomes stuck at the systemd message “a start job is running for /dev/disk/by-uuid/fa5024…”. I am unable to debug what went wrong and thus my only choice is to boot from an older btrfs snapshot and then do sudo transactional-update rollback (thankfully I read this blog post about transactional-update before I first installed Tumbleweed).

My computer has two disks: a SATA SSD for my home directory and a NVMe SSD for everything else. Both disks use full disk encryption. My setup looks like this:

$ lsblk -f
NAME                       FSTYPE      FSVER    LABEL UUID                                   FSAVAIL FSUSE% MOUNTPOINTS
sda                        LVM2_member LVM2 001       veF68U-DZ7m-kyJQ-6sZc-YLA1-C1oR-YYsku2
└─main--data-data--lv      crypto_LUKS 1              fa5024db-d236-46e0-b40e-af70897e1728
  └─cr_main--data-data--lv btrfs                      1ec87e06-f4d3-4416-9bc5-de0776f9e467        3T    16% /home
nvme0n1
├─nvme0n1p1                vfat        FAT32          66AC-EA1A                                             /boot/efi
├─nvme0n1p2                vfat        FAT32          66AD-2D8F
├─nvme0n1p3                crypto_LUKS 1              78f303b1-7c27-4f30-8de7-294d988f1b7b
│ └─cr_root                btrfs                      5632b905-1c13-4fe6-ad6f-2829cea25893    359.4G    19% /usr/local
│                                                                                                           /srv
│                                                                                                           /opt
│                                                                                                           /boot/writable
│                                                                                                           /boot/grub2/x86_64-efi
│                                                                                                           /boot/grub2/i386-pc
│                                                                                                           /.snapshots
│                                                                                                           /var
│                                                                                                           /root
│                                                                                                           /
└─nvme0n1p4                crypto_LUKS 1              54f7c85a-c2e4-4320-807d-e5b3868b9445
  └─cr_swap                swap        1              f6eeeddf-9c4d-488f-a94f-0577bc1f9d76                  [SWAP]
$ ls -l /dev/disk/by-uuid/fa5024db-d236-46e0-b40e-af70897e1728
lrwxrwxrwx 1 root root 10 Sep  7 20:14 /dev/disk/by-uuid/fa5024db-d236-46e0-b40e-af70897e1728 -> ../../dm-1

It appears to me that the start job is waiting for the disk with UUID fa5024... which holds LUKS encrypted data. It appears to use device mapper, although I installed Tumbleweed on this system quite a few years ago and I cannot recall why it used device mapper.

$ sudo dmsetup info
[sudo] password for root:
Name:              cr_main--data-data--lv
State:             ACTIVE
Read Ahead:        1024
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      254, 2
Number of targets: 1
UUID: CRYPT-LUKS1-fa5024dbd23646e0b40eaf70897e1728-cr_main--data-data--lv

Name:              cr_root
State:             ACTIVE
Read Ahead:        1024
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      254, 0
Number of targets: 1
UUID: CRYPT-LUKS1-78f303b17c274f308de7294d988f1b7b-cr_root

Name:              cr_swap
State:             ACTIVE
Read Ahead:        1024
Tables present:    LIVE
Open count:        2
Event number:      0
Major, minor:      254, 3
Number of targets: 1
UUID: CRYPT-LUKS1-54f7c85ac2e44320807de5b3868b9445-cr_swap

Name:              main--data-data--lv
State:             ACTIVE
Read Ahead:        1024
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      254, 1
Number of targets: 1
UUID: LVM-VmW33sTeAZFxqxGZ1GmApEMPQth7j0LmLDJJBwlfPDWS7C921JnpGaXnHeux4RVg

I believe the start job comes from the fact that the volume is mentioned in etc/crypttab:

$ sudo cat /etc/crypttab
cr_root  UUID=78f303b1-7c27-4f30-8de7-294d988f1b7b  /.root.key  x-initrd.attach
cr_swap  UUID=54f7c85a-c2e4-4320-807d-e5b3868b9445  /.root.key  x-initrd.attach
cr_main--data-data--lv  UUID=fa5024db-d236-46e0-b40e-af70897e1728 /.root.key  x-initrd.attach,keyfile-timeout=10s

I have attempted the following strategies to debug the system:

Since I’m running Linux 6.9.5 and the new Tumbleweed releases that fail to boot use Linux 6.10.x, I attempted to rule out kernel malfunctions by enabling multiversion for the kernel in /etc/zypp/zypp.conf and I managed to keep the 6.9.5 kernel together while using the updated packages from the rest of the system. It still didn’t boot.
I tried to use the new system but edit the GRUB boot entry with various things, including appending to the kernel command line systemd.unit=emergency.target. That did not boot either.
I also tried to get a shell by specifying init=/usr/bin/bash in the kernel command line but that appeared to have no effect.
I thought perhaps the decryption using the keyfile had an issue, so I changed the crypttab to always prompt me for a password. That worked with the old Tumbleweed release but not any new release.
I looked into whether there’s a way to do zypper ref but refresh into an older version so that I can attempt to upgrade to some newer version and do some bisecting. I didn’t find any way to update to a Tumbleweed snapshot that’s not the latest.
Since I suspect initrd to be the problem, I did sudo lsinitrd /.snapshots/137/snapshot/boot/initrd (where 137 is a snapshot that didn’t boot) and looked at the output; I compared it with the old initrd for the 6.9.5 kernel and didn’t find anything suspicious.

I’m out of ideas on how to troubleshoot this further. Please help.

heitormoreira · September 8, 2024, 4:49am

I’m not following Tumbleweed, but you can check the /etc/fstab and the UUID of the device above before and after the upgrade.

arvidjaar · September 8, 2024, 5:54am

This implies that it is stuck in initrd. Does adding rdinit=/usr/bin/bash change anything?

So, upload both outputs to https://paste.opensuse.org/ for others to peruse.

marel · September 8, 2024, 6:25am

If that is right, it is mounting the LUKS partition of the /des/sda disk.

Can you create a USB boot disk with a recent Tumbleweed build, boot with that and see if you can mount that LUKS partition?

tranquilreed · September 8, 2024, 3:25pm

Sure, here are the two pastes: openSUSE Paste and openSUSE Paste

tranquilreed · September 8, 2024, 3:56pm

I tried your first suggestion of rdinit=/usr/bin/bash just now. It gave me a bash shell almost immediately after GRUB. So that’s the good thing. The bad thing is that my keyboard does not work at all. My keyboard is typically plugged into my Thunderbolt display, but even if I directly plug in to the computer it still doesn’t work.

tranquilreed · September 8, 2024, 7:09pm

I found something extremely interesting and relevant when I created a USB boot disk from the most recent Tumbleweed release.

In the blue linuxrc screen, if I choose “Expert” and then “Start shell” I get a shell. When I run lsblk -f the fa5024... disk does not appear at all. Here I have /dev/sda which is shown as a LVM2_member. But there does not exist any /dev/dm-0 or any /dev/dm-*. The command lvdisplay doesn’t exist. The command vgscan also doesn’t exist.

However if I choose “Start Installation” then “Rescue System” I get some systemd output followed by “rescue login” prompt. I enter root and I get a shell. In this shell running lsblk -f shows the fa5024 disk appearing. When I run ls -l /dev/dm-0 I find that it exists and is a block device. When I run vgdisplay it shows the disk. When I use cryptsetup open /dev/dm-0 it prompts me for my password and then afterwards I can see the btrfs file system containing my home directory appearing. The UUID matches exactly what I see in my working system.

Does this mean when I use the rescue system feature of the USB disk, it gets the device mapper disks but not before? Maybe there ought to be a way to have device mapper disks be supported earlier in the boot? I also don’t know what the rescue system feature did to get to this state.

marel · September 8, 2024, 8:18pm

Good you tried a USB boot with a recent Tumbleweed install, your second experiment shows there is no problem with recent Tumbleweed but it seems like configuration problem.

What you write about LVM2_member and vgdisplay means that there is more at play then your initial “lsblk -f” shows and I do not have experience with that.

arvidjaar · September 10, 2024, 1:06pm

Actually, none of them has LVM stuff and so cannot configure LV in initrd. Can you boot using “good” kernel/initrd and with printk.devkmsg=on log_buf_len=16M on the kernel command line and upload full output as root of

journalctl -b --full --no-pager

ektus · September 10, 2024, 4:09pm

It seems like I’ve got a very similar problem. I’m not using encryption, but I do have my home dircectory linked from a different drive. The normal boot process runs into failures, rescue system encounters the same errors but starts, albeit not all services are running (e.g. polkit) and group names cannot be resolved.


ektusFW16: # systemctl status systemd-tmpfiles-setup-dev-early.service
× systemd-tmpfiles-setup-dev-early.service - Create Static Device Nodes in /dev gracefully
     Loaded: loaded (/usr/lib/systemd/system/systemd-tmpfiles-setup-dev-early.service; static)
     Active: failed (Result: core-dump) since Thu 2024-09-05 21:14:59 CEST; 34s ago
   Duration: 17.628s
 Invocation: 5a1a14d19a354229b07842ba6beccfd2
       Docs: man:tmpfiles.d(5)
             man:systemd-tmpfiles(8)
    Process: 1271 ExecStart=systemd-tmpfiles --prefix=/dev --create --boot --graceful (code=dumped, signal=SEGV)
   Main PID: 1271 (code=dumped, signal=SEGV)
        CPU: 8ms

Sep 05 21:14:59 ektusFW16 systemd[1]: Starting Create Static Device Nodes in /dev gracefully...
Sep 05 21:14:59 ektusFW16 systemd[1]: systemd-tmpfiles-setup-dev-early.service: Main process exited, code=dumped, status=11/SEGV
Sep 05 21:14:59 ektusFW16 systemd[1]: systemd-tmpfiles-setup-dev-early.service: Failed with result 'core-dump'.
Sep 05 21:14:59 ektusFW16 systemd[1]: Failed to start Create Static Device Nodes in /dev gracefully.

ekkehard@ektusFW16:/$ sudo stat -c "%U %G" .
root UNKNOWN

I’m on a framework 16, and tumbleweed has been running flawlessly for a couple months now. I can still get into a somewhat working system by starting the rescue system, logging in as root and immediately exiting the shell again.
I’m currently on 6.10.8-1-default but had the same error with 6.10.5-1 and that one has been running without problems before, but doesn’t now.

Some more info from my thread on Open suse tumbleweed won't boot properly - Linux - Framework Community

Which Linux distro are you using?
OpenSuse tumbleweed
Which release version?
(If rolling release, last date updated?)
2024-09-07 (also a couple days earlier)
Which kernel are you using?
6.10.5-1-default
[edit 2024-09-08]
6.10.8-1 is the same, as is 6.10.7-1
[/edit]
Which BIOS version are you using?
latest beta (3.03?)

Which Framework Laptop 16 model are you using? (AMD Ryzen™ 7040 Series)
7840 without dGPU

For a couple days now, I’m unable to successfully boot my openSUSE tumbleweed installation. The boot process fails with

systemd-tmpfile[763]: segfault at 666e6f00666e ip 00007fc7df13b393 sp 00007ffd974447b0 error 4 in libc.so.6[13b393,7fc7df028000+16e000] likely on CPU 0 (core 0, socket 0)
kernel: Code: 90 90 90 90 90 f3 0f 1e fa 90 90 41 57 41 56 41 55 49 89 fd 41 54 55 48 89 f5 53 48 83 ec 08 48 8b 1d 71 4a 0b 00 64 44 8b 23 <8b> 07 83 f8 01 74 27 83 f8 02 75 19 64 44 89 23 48 83 c4 08 31 c0

This is the first in a row of errors. I can get the system to a semi-working state, but polkit is not running and no group names can be shown.
To illustrate ls -l:
-rw-rw-r-- 1 ekkehard 1000 4553965568 4. Sep 19:54 openSUSE-Tumbleweed-DVD-x86_64-Snapshot20240903-Media.iso
-rw-r--r-- 1 root 0 125806993 3. Sep 05:17 systemjournal20240903.txt
I’ve tried booting with systemd.log_level=debug and got the resulting journalctl -xb in a file, but wasn’t able to determine the root cause yet. The system has been running successfully with this kernel for some time prior to the problems occurring.
Trying to execute the boot with systemd.confirm_spawn=true wasn’t successful as the system wouldn’t accept any keyboard input when prompting to allow new processes.

tranquilreed · September 10, 2024, 11:06pm

Sure thing; here it is: openSUSE Paste

I looked at it briefly. It did say lvm[1134]: 1 logical volume(s) in volume group "main-data" now active followed soon by systemd[1]: Found device /dev/disk/by-uuid/fa5024db-d236-46e0-b40e-af70897e1728.

So I presume that lvm is somehow active.

arvidjaar · September 11, 2024, 7:21am

It happens after initrd.

There are no traces of LVM or your /home device in initrd:

Sep 10 18:52:17 localhost systemd[1]: Expecting device /dev/disk/by-uuid/5632b905-1c13-4fe6-ad6f-2829cea25893...
Sep 10 18:52:17 localhost systemd[1]: Expecting device /dev/disk/by-uuid/66AC-EA1A...
Sep 10 18:52:17 localhost systemd[1]: Expecting device /dev/disk/by-uuid/78f303b1-7c27-4f30-8de7-294d988f1b7b...

Those are LUKS container for root, filesystem inside this container and ESP.

But /etc/crypttab shown by you has x-initrd.attach option for /home device, and dracut is expected to include this line. Which apparently happens in the bad case.

Good initrd:

-rw-r--r--   1 root     root           90 May 27 08:09 etc/crypttab

Bad initrd

-rw-r--r--   1 root     root          299 Aug  9 08:30 etc/crypttab

Can you show the /etc/crypttab from initrd in both cases? You can extract files from initrd using

lsinitrd /boot/initrd-XXX etc/crypttab

Also, are you sure /etc/crypttab content is the same in both cases? May be older snapshot simply did not have x-initrd.attach option?

arvidjaar · September 11, 2024, 7:35am

Or it had older dracut version:

* Tue Jul 02 2024 antonio.feijoo@suse.com
- Update to version 059+suse.628.g20b345b4:
  * feat(crypt): force the inclusion of crypttab entries with x-initrd.attach (bsc#1226529)

tranquilreed · September 11, 2024, 10:50pm

I think we are getting somewhere!

The good and bad initrd:

$ sudo lsinitrd /boot/initrd etc/crypttab # good
[sudo] password for root:
cr_root /dev/disk/by-uuid/78f303b1-7c27-4f30-8de7-294d988f1b7b /.root.key x-initrd.attach
$ sudo lsinitrd /.snapshots/137/snapshot/boot/initrd etc/crypttab # bad
cr_root /dev/disk/by-uuid/78f303b1-7c27-4f30-8de7-294d988f1b7b /.root.key x-initrd.attach
cr_swap /dev/disk/by-uuid/54f7c85a-c2e4-4320-807d-e5b3868b9445 /.root.key x-initrd.attach
cr_main--data-data--lv /dev/disk/by-uuid/fa5024db-d236-46e0-b40e-af70897e1728 none x-initrd.attach,keyfile-timeout=10s

So it appears that I just need to convince dracut to reduce the initrd to just the root device.

tranquilreed · September 12, 2024, 12:43am

I decided not to fight with dracut or forcing an older version of dracut. I simply removed x-initrd.attach from the last line in /etc/crypttab. It’s not needed anyways: I don’t think there’s a need to unlock my home directory during early boot.

And I’m happy to report that this solution worked! I just ran sudo transactional-update dup and the system boots just fine.

Silancu · September 12, 2024, 2:53pm

Well, I am far from being an expert, but the thing is I couldn’t understand what had happened if in rescue system after mounting all pertinent stuff and chrooting would you had re build the initrd. Why didn’t you tried it?

tranquilreed · September 14, 2024, 12:32am

If @arvidjaar didn’t notice that the initrd has been changed between the working and non-working system, I wouldn’t know to fix that in the rescue system.

Also working with transactional updates is much more pleasant than the rescue system. I’d rather be in my familiar environment first.

ektus · September 14, 2024, 5:58pm

My case showed similar symptoms, but likely wasn’t related.

I finally found the problem. It was an entry in /etc/nsswitch.conf that led to getent not working (like, at all) and producing lots of subsequent errors.

There was an entry group: [SUCCESS=merge] compat systemd. After altering this one to group: compat [SUCCESS=merge] systemd the problems vanished. I’ve got no idea what went wrong there.

system · September 21, 2024, 5:58pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.