[LEAP 15.2] Unable to boot 5.x kernels after updates installed this week

arcasinky · September 13, 2020, 5:54pm

Hi. This system was upgraded from Leap 15.1 to 15.2 about 4-5 weeks ago. The upgrade itself seemed to go smoothly and I had rebooted the system a number of times since then following updates.

Yesterday, things went sideways. A lot of updates were pushed by the OpenSuSE update repos this week, some of which set the needs-reboot flag. I think I saw a systemd update, a udev update, maybe a new kernel and several others that each would have set the needs-reboot flag. I couldn’t afford to reboot at the time so I deferred rebooting until Saturday.

Saturday I performed the reboot and was left with a machine that hung midway through the kernel bring-up. The keyboard is unresponsive and I’m almost immediately presented with a splash screen with 3 green question marks. Eventually I’m returned to an emergency shell prompt but, again, the keyboard is unresponsive so I can’t log in. I do recall seeing a message indicating that systemd-journal failed to start. (this becomes important below). Even ALT-SYSRQ-B is unresponsive. My only option seems to be to power-off.

Here’s the thing: Since this machine was originally a Leap 15.1 machine, the grub menu still contains a 4.12.14 kernel entry that dates back to 15.1. **This old kernel boots successfully. **However, since journald never started during the failed boot, there’s no log information saved anywhere from the failed boot to guess what’s going wrong.

On a hunch that maybe the 5.3.x kernel itself was broken, I installed the 5.8.x kernel from Kernel:stable repository. This is where I made a mistake: the 5.8 kernel yields the same mid-boot hang with a dead keyboard. But my zypp.conf was the opensuse default version which means it only kept 3 kernels which means the act of installing 5.8 caused zypper to delete the 4.12 kernel that DID boot. ****.

So I gave up and re-imaged my hard drive using a bitwise copy that I’d made just prior to performing the 15.1–>15.2 upgrade. Booted back into 15.1 and performed the upgrade to 15.2 again. This, again, left me with a kernel that hung midway through the boot if I selected the 5.3 kernel from Grub. I suppose this should be expected since the latest-available packages were installed during the upgrade to 15.2. So ostensibly, at this point, the new 15.2 system’s packages matched the old broken 15.2 system’s packages and the boot behavior also matched.

Once again, though, the old 4.12.14 kernel left over from 15.1 boots fine. This time I changed zypp.conf to preserve that kernel!

So I’m left scratching my head. 5.3 booted fine until updates were installed this week. The fact that multiple kernels (5.3 from the official repo and 5.8 from the Kernel:stable repo) hang, it’s probably not a kernel problem. Since 4.12.14 boots okay and appears to be completely usable, it’s probably not a hardware problem.

That suggests that there’s a problem inside the initrd images. But what? I tried manually reverting systemd, udev and dracut to 31.4.1 which rebuilt the initrd images but the failed boot behavior on 5.3/5.8 persists. I was unable to install earlier versions…when I tried to install systemd- or udev-30.x.x, yast complained that systemd-mini and udev-mini had unresolved dependencies (even though those packages aren’t installed). Since I didn’t understand the nature of that error (why would I get dependency errors for packages that aren’t even installed?) I didn’t force-install the older versions out of fear of rendering the system in even worse state.

Throughout this, 4.12.14 remains rock solid. This is curious because its initrd image should have been rebuilt alongside the others when I reverted systemd, etc. So if there’s problem with the contents of the initrds or with the way dracut is constructing initrd, I would have expected this to break the 4.12.14 kernel. But no, it seems fine.

I’m open to suggestions about how to proceed. The inability to log into the emergency shell during the failed boot means that I have very little information about what went wrong.

nrickert · September 13, 2020, 6:22pm

When you see that screen with 3 green question marks, you can hit ESC. That will give more information on the screen.

If you like that change, you can make it permanent. I edit the kernel line in the boot settings (Yast Bootloader), and remove the “splash=silent” from there. That way, I always see messages during boot. Maybe those messages will help work out what is going wrong.

arcasinky · September 13, 2020, 6:48pm

Thanks, I wasn’t aware that you could turn off that splash. ESC / Insert / etc doesn’t work because the keyboard is dead at this point. I’ll turn it off and see if I can catch one or more errors.

arcasinky · September 15, 2020, 11:09pm

Turning off the splash didn’t help. Plymouth clears the screen (including any errors that were displayed), prints a couple status lines and then hangs.

It turns out my mainboard has a serial port header so as a last resort, I bought a $6 connector that plugs into this header and gives the usual 9-pin serial port. I was then able to boot with a serial console enabled. Turns out while the keyboard is dead, the serial console remains very much alive.

So… the error…

There are a number of systemd failures, the first of which is a failure to start systemd-journald. “systemctl status systemd-journald.service” indicates that the exit code was “228/SECCOMP”. Another was a failure to start systemd-udev. I didn’t look at the exit code for this one but it wouldn’t surprise me if it’s also SECCOMP-related.

So…why?

There were a bunch of updates installed the week before the system stopped booting. I do recall seeing updates for the kernel, systemd and udev. And the changelog for libseccomp suggests that the current v2.5.0 was pushed to the repos within the last week.

Okay. So if seccomp stuff changed and things were somehow out of sync, that might explain why systemd stuff is failing with SECCOMP error codes. But I haven’t a clue why nobody else is seeing this. Near as I can tell, everything is up to date and since I try to avoid messing with systemd configuration, my /usr/lib/systemd stuff should be more or less stock.

So is there another mechanism that can lead to SECCOMP errors?

This system is a Ryzen 2700x, btw.

nrickert · September 15, 2020, 11:47pm

You can put “plymouth.enable=0” on the kernel line, to completely turn off plymouth.

I regret that my I’m not knowledgeable enough about the other issues.

mbrookhuis · September 16, 2020, 8:44am

I think I am having the same issue and it has to do something with the updates coming with the new kernel in the shim. The root files system is not (properly) mounted.

After the update and the first reboot the MOK admin page will appear. Don’t know what to do in that screen other then accept the changes and enter the root password when asked. But then the reboot still fails. Going back to the previous snapshot works and I can continue.

How and what do we need to do to get the new shim (UFI and grub2) working?

Libero · September 16, 2020, 6:26pm

arcasinky:

Turning off the splash didn’t help. Plymouth clears the screen (including any errors that were displayed), prints a couple status lines and then hangs.

It turns out my mainboard has a serial port header so as a last resort, I bought a $6 connector that plugs into this header and gives the usual 9-pin serial port. I was then able to boot with a serial console enabled. Turns out while the keyboard is dead, the serial console remains very much alive.

So… the error…

There are a number of systemd failures, the first of which is a failure to start systemd-journald. “systemctl status systemd-journald.service” indicates that the exit code was “228/SECCOMP”. Another was a failure to start systemd-udev. I didn’t look at the exit code for this one but it wouldn’t surprise me if it’s also SECCOMP-related.

So…why?

There were a bunch of updates installed the week before the system stopped booting. I do recall seeing updates for the kernel, systemd and udev. And the changelog for libseccomp suggests that the current v2.5.0 was pushed to the repos within the last week.

Okay. So if seccomp stuff changed and things were somehow out of sync, that might explain why systemd stuff is failing with SECCOMP error codes. But I haven’t a clue why nobody else is seeing this. Near as I can tell, everything is up to date and since I try to avoid messing with systemd configuration, my /usr/lib/systemd stuff should be more or less stock.

So is there another mechanism that can lead to SECCOMP errors?

This system is a Ryzen 2700x, btw.

I encountered the same problem as you. After I updated oS Leap 15.2 this week, I got several “failed to start udev kernel device manager” and “failed to start journal service” error message during boot. Changing kernel to an older version in grub2 didn’t help.
After a couple of “snapper rollback”, I managed to narrow the problem down to the libseccomp2 and libseccomp2-32bit (2.5.0-lp 152.88.1) updates.
If I skip these updates, my system boots flawlessly (with kernel 5.13.18-lp 152.41 too).

arcasinky · September 16, 2020, 9:02pm

Thanks. I couldn’t use snapper to roll back but reverting libseccomp and friends to v2.4.1 and rebuilding the initrd files does indeed resolve the problem.

Definitely looks like something is out of sync between libseccomp v2.5.0 and systemd.

mbrookhuis · September 17, 2020, 8:30am

Can confirm. libseccomp2 is causing the problem

wolfi323 · September 22, 2020, 2:21pm

See https://bugzilla.opensuse.org/show_bug.cgi?id=1176470.

So, do not install the newer libseccomp2 from the “security” repo, better stick to the one from the standard repos…