Tumbleweed failure after upgrade -- udev fork failures and no longer booting

I need help, and I admit I know far too little about OpenSuSE to offer much in the way of information about what my problem might be, at least in an initial post. I apologize for that, but I’ll try to explain what happened.

I am using OpenSuSE Tumbleweed. I had previously used OpenSolaris, and I knew that extremely well (former Sun engineer and kernel developer), but I switched to OpenSuSE (and thus Linux) for new hardware support.

I update once every couple of weeks. Mostly, it goes ok, except for having to recompile the NVIDIA driver manually, which is quite a pain. But it works. (I’d probably upgrade more frequently if this were less of a hassle.)

That is, it did work until today. I did my usual dance of running “zypper up” and waiting while 2GB+ of updates downloaded and installed. There were a few file conflicts between some texinfo packages – but I don’t care much about that. Then I rebooted to single user, and rebuilt the NVIDIA driver, as usual. That worked.

Then I rebooted again. That’s when the horrors started. First, the boot hung, so I pressed “Esc” to get off the three-dots screen and see what it was doing. I saw something about waiting on “dev-md-tank.device” with a “1min 30s” to go. It timed out and gave me “emergency mode.”

I logged in. I have two RAID arrays – a root mirror on /dev/md0 and a RAID5 array of 4 drives called /dev/md/tank. /proc/mdstat said only md0 was there. So, I ran mdadm -A /dev/md/tank. That immediately got the RAID5 array back without complaint. I mounted /tank manually, and everything was there. No problems. I searched around and found documentation about using mdadm to dump state, and I looked through it. There were no problems at all – all drives present, clean, no issues. It doesn’t appear to be an mdadm problem.

journalctl told me that systemd-udevd is in terrible shape. It is getting tons of fork failures, and is erroring out on all of the commands it’s trying to run as a result. I tried playing with “udevadm control -m” to set a different number of children (I tried high and low numbers), but nothing made any difference. Something seems to be wrong with udevd, but I don’t know what, and try as I might, I cannot locate where any of the configuration information comes from. (/usr/lib/udev looks promising, but I can’t figure out how any of it relates to anything I see.) If only I could delete “dev-md-tank.device” from the system I would have been able to proceed.

I tried rebooting. Hung again. I tried commenting out the mdadm.conf “ARRAY” entry to get rid of it temporarily (it’s just data; I don’t need it to boot) and removing “partitions” from the “DEVICE” line, but the system insisted on timing out on dev-md-tank.device. I googled but found no way to delete that udev entry or disable it or skip it or get around the dependency. It seems like it’s automatic somehow. I tried assembling it and mounting it manually and doing “systemctl default” to continue the boot. That hung because of something called “Plymouth.”

Plymouth is dropping core and I don’t know why. I read around the Internet some more, and like the complete idiot I am, I found a web page saying I should run “dracut.” That was a horrible, horrible mistake. Now it doesn’t boot at all. Just a cursor in the upper left corner and that’s it.

I got the system to boot by manually entering kernel and initrd 4.4.3-1 instead of the new kernel 4.5.0-3. Now the system is up (sort of; Xorg is sick, hangs at 100% CPU, and I can’t log in), but I have no idea how to edit the grub menu to force it to stay on the semi-working 4.4.3 version and avoid the completely broken 4.5.0 version, so reboots are perilous. I tried reinstalling “kernel-default” with zypper in the hope that this would rebuild whatever “dracut” broke, but it didn’t. Boots into 4.5.0 just hang – “Esc” from the three-dot screen just gives me a cursor and nothing else.

I’m stuck. Before I migrate back to one of the OpenSolaris distributions, where I feel like I know what I’m doing (and have the option of downgrading!), what should I try?

journalctl output from failed 4.5.0 kernel boot:

http://www.workingcode.com/journalctl-1.txt

and output from reversion back to 4.4.3:

http://www.workingcode.com/journalctl-2.txt

You may want to try Leap which is not a rolling distro like tumbleweed. Tumbelweed is in constant flux and as such more likely to run into problems

You can reinstall using the latest released media and do an upgrade which will preserve your programs and your home.

openSUSE is package in assumption that people use “zypper dup” to update between distributions, where each Tumbleweed release is considered own distribution in this respect. While “zypper up” mostly works, there could be corner cases; it makes sense to run “zypper dup” every now and then to make sure everything is cleaned up.

journalctl told me that systemd-udevd is in terrible shape. It is getting tons of fork failures, and is erroring out on all of the commands it’s trying to run as a result…

journalctl output from failed 4.5.0 kernel boot:

http://www.workingcode.com/journalctl-1.txt

and output from reversion back to 4.4.3:

http://www.workingcode.com/journalctl-2.txt

Both have tons of “fork failed: resource temporary unavailable”, so there is no difference in this respect. You just hit different problems, that’s all. Try disabling apparmor - pass “apparmor=0” on kernel command line. I do not really see anything else that can be responsible for that.

OK; thanks. I had no idea that this distinction existed. My reading of the documentation suggested that “dist-upgrade” was for switching distributions or doing something that was otherwise unusual. Assuming I can get the system working right again, I’ll switch to using “dup” every time.

Both have tons of “fork failed: resource temporary unavailable”, so there is no difference in this respect. You just hit different problems, that’s all. Try disabling apparmor - pass “apparmor=0” on kernel command line. I do not really see anything else that can be responsible for that.

Big difference! In one case, it’s bootable and most of the system seems to run (I’ve lost the console – keyboard and mouse are completely dead; no lights at all, and screen is dark – but otherwise it “works”), and in the other case it’s just a brick.

And the lack of any way to disable an unwanted “.device” entry is absolutely infuriating. But, like SMF before it, I’ve become used to infuriation with systemd. There are many things it does that I don’t understand. :-/

OK, yes, I’ll try disabling apparmor. I never wanted it in the first place; it just came with the system and I couldn’t find any reasonably clear documentation on how to get rid of it completely. It seems pretty invasive, and doesn’t really work as well as Least Privilege.

Thanks for the note. Yes, I do feel now like I’ve made a mistake with Tumbleweed. I spent quite a while trying to evaluate which of the two fit better with my goals, and, frankly, I think I failed at that task.

I’ll plan that as a longer-term strategy.

To avoid misunderstanding - I did not imply this to be the solution. What you see is obviously a bug; disabling apparmor was suggested as one step in troubleshooting it. If apparmor turns out to be the root cause, next step would be to find out what is different in your case (because TW obviously works for a lot of users, so something must be different). But we need to start somewhere.