zypper dup breaks root/sudo and reboot fails at switching root

I posted about this over on the openSUSE Reddit but I tried again today and getting similar results. I’ll outline what happened (so far) and then update this as I go from my phone because I know once I reboot, the only way to get back into a working system is to go to Maintenance Mode and use snapper to undo the update.

  1. Perform a zypper dup
  • nothing out of the ordinary. 1. Wait a few minutes, attempt to use sudo for something, password will fail.
  1. Try opening YaST and it’ll complain user ‘root’ not found
    . 1. Reboot.
  2. Startup will commence but fail at switch root (as far as I can tell).

To recover:

  1. Boot into Maintenance Mode.
  2. **snapper list **
    (get the numbers) 1. snapper -v undochange xx…xx
  3. Reboot
    [LIST=1]
  4. The default option will no longer work, must go to advanced and select the previous kernel.

[/LIST]

System Information:Memory: 8GB
Processor: Intel Core i7-3687U
Graphics: Intel Ivybridge Mobile

If there’s any other information I can provide, please let me know. I thank anybody in advance for your help.

Some details please.
What version of Tumbleweed are you running (/etc/os-release), what version (or date) was your unsuccessful ‘zypper dup’. what versions of ‘ucode-intel’ and ‘kernel-firmware’ are installed and what repositories do you have enabled, also what is the make and model of the machine?

 >  zypper lr -d -E
 >  cat /etc/os-release
 >  rpm -q ucode-intel
 >  rpm -qa kernel-*

No problem. Thanks for helping me dig in to what I’ve got going on here.

Repos:

~> zypper lr -d -E
Repository priorities are without effect. All enabled repositories share the same priority.

# | Alias              | Name                        | Enabled | GPG Check | Refresh | Priority | Type   | URI                                                   | Service
--+--------------------+-----------------------------+---------+-----------+---------+----------+--------+-------------------------------------------------------+--------
1 | Visual Studio Code | Visual Studio Code          | Yes     | (r ) Yes  | No      |   99     | rpm-md | https://packages.microsoft.com/yumrepos/vscode        |        
4 | repo-non-oss       | openSUSE-Tumbleweed-Non-Oss | Yes     | (r ) Yes  | Yes     |   99     | yast2  | http://download.opensuse.org/tumbleweed/repo/non-oss/ |        
5 | repo-oss           | openSUSE-Tumbleweed-Oss     | Yes     | (r ) Yes  | Yes     |   99     | yast2  | http://download.opensuse.org/tumbleweed/repo/oss/     |        
7 | repo-update        | openSUSE-Tumbleweed-Update  | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | http://download.opensuse.org/update/tumbleweed/       |

OS Version:
This is trickier. I use TW as my daily driver and since I can easily backout this change, I’ve rolled back via Snapper and am on my last known working state. Right now, my os-release reads as below, but the problem was reproducible on 20180127 and is still reproducible on 20180128:


~> cat /etc/os-release 
NAME="openSUSE Tumbleweed"
# VERSION="20180125 "
ID=opensuse
ID_LIKE="suse"
VERSION_ID="20180125"
PRETTY_NAME="openSUSE Tumbleweed"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:tumbleweed:20180125"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"

Intel:
Again, the output below is current system. 20180128 has a ucode-intel upgrade to 20171117-4.1.

~> rpm -q ucode-intel
ucode-intel-20171117-3.1.x86_64

Kernel:
Again, on my current system. Upgrading to 20180128 has kernel-default of 4.14.15-1.6 available for me.

~> rpm -qa kernel-*
kernel-default-4.14.14-1.7.x86_64
kernel-default-4.14.12-1.5.x86_64
kernel-firmware-20180104-1.2.noarch

Hardware:
Dell XPS13 (L321X), 8GB, Intel Core i7-3687U, Intel Ivybridge Mobile graphics

If at any point you need me to return to full broken state to get information, just give me about 10-30 minutes (depending on if I’m at home or work; vastly different internet connections).

Thank you!

I just tried upgrading again now that TW has the 20180129 snapshot in there and the behavior is the same.

I locked ucode-intel to its current version and upgraded and still fails.

Do you maybe have installation of recommended packages disabled?
Though I don’t think that’s the problem, otherwise there would have been a few more reports of this problem (judging from the Mesa-dri “issue”).

Btw, “switching root” has nothing to do with “user root”, the former is when the system is switched to the installed / partition from the initrd.

If switching to root fails, somehow the “driver” for the / partition may be missing in the initrd.
So, any special partion setup, like LVM or the like?

Wouldn’t explain the missing “user root” though…

Grasping at straws, do you have a separate /boot partition that might be full?
But then you shouldn’t even be able to use snapshots/revert to a previous snapshot, unless that has changed recently…
Maybe your / partition is too full?

I belive I am possibly suffering from the same. After a recent update in Tumbleweed, the system boots to Grub, then loads the initramdisk, but just after kernel spits out ‘switching root’ it hangs, then it tries to mount root and all other btrfs subvolumes but times out and offers rescue mode. In journalctl there is a line that initializing according to udev database failed. Trying to manually mount the subvolumes hangs the system forever.

I can, however, boot a live disk and mount the subvolumes from there. They report no errors in the logs, scrub turns out fine and everything is fine. No hardware failure according to SMART. To me it looks like the automounting of the partition/subvolume just fails after Grub. And Grub, I believe, gets the location of root from the parameters that it hold in its own config or from the kernel when it is compiled in there. But I don’t know what happens after that. Is it that systemd automounts the devices with the help of udev"?

Sorry if this isn’t the same issue.

EDIT here is another person reporting it and the screen shots look exactly like what I have now.
https://forums.opensuse.org/showthread.php/529380-failure-to-load-or-mount-certain-aspects-of-the-file-system-on-boot

I have not explicitly defined that in {zypp,zypper}.conf so it is using Zypper defaults, which I believe is to install recommended packages.

It is an encrypted installation (LVM+LUKS) with btrfs, set up through the Tumbleweed installer.

Yeah, that one is really weird to me. Never seen that happen before.

The /boot is separate but it isn’t close to being full, at least not looking at it through df.


~> df -h /boot
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       403M   96M  264M  27% /boot

Only 50% showing but that’s because df is only seeing 40GB even though I extended the lv to the rest of the disk and Partitioner shows the rest of the disk. I’m guessing there’s some btrfs magic I need to do to expand to the rest of the LV, though. Just haven’t go around to it.


~> df -h /
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/system-root   40G   25G   16G  62% /

~> sudo lvdisplay
[sudo] password for root: 
  --- Logical volume ---
  LV Path                /dev/system/root
  LV Name                root
  VG Name                system
  LV UUID                xcoUgu-CkZp-z1QB-HICy-0QSZ-BQMc-uFeXRZ
  LV Write Access        read/write
  LV Creation host, time voyager, 2017-12-27 14:23:38 -0700
  LV Status              available
  # open                 1
  LV Size                468.34 GiB
  Current LE             119896
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     1024
  Block device           254:1

Thanks for hopping on, wolfi. I’ve read enough on here to be worried when you throw out things like “grasping at straws” and the confusion on missing root bit. I just assume you’ve seen everything.

Just making LVM bigger does not expand the file system you have to tell the FS to expand.

man BTRFS-filesystem

for details

I did immediately after I posted. I tried to update my post but was past my 10 minute limit. Just a stupid step I forgot, that’s all. :slight_smile: Now it’s all good.


~> df -h /
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/system-root  469G   27G  442G   6% /

Based on what pylkko saw in the journal (udev), I held back udev, which systemd needed, so I held back those, too. The packages I held back (since they were dependent on each other) were

  • udev
  • systemd
  • systemd-bash-completion
  • systemd-logger
  • systemd-sysvinit

…and I was able to upgrade and boot.


~> cat /etc/os-release 
NAME="openSUSE Tumbleweed"
# VERSION="20180129 "
ID=opensuse
ID_LIKE="suse"
VERSION_ID="20180129"
PRETTY_NAME="openSUSE Tumbleweed"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:tumbleweed:20180129"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"

Yeah it’s a really anal timeout since this is not a financial board. They could set it to 2 days without enabling submarine spammers to change posts months later.

Not the point you have 5 min to change before it gets sent out to those that get board via email

And changing your post after it may have been read and answered to might create confusing threads. That is why, when you want to add additional information, you simply add a new post.
The five minutes is for those hasty ones that send off their posts before having checked it for typos. I assume it is clear that it is better to first think about how you best post your information before typing, then type, then check, then rethink and then when you are convinced that this is a useful and understandable post click on the Post Message button.

Right.
I just asked because adding system users has been split out into several packages which need to be installed to get that user added to the system. Not installing recommended packages might cause some users “missing”, but OTOH the necessary packages probably are required anyway.
Not really relevant for root anyway though, I think.

It is an encrypted installation (LVM+LUKS) with btrfs, set up through the Tumbleweed installer.

Ok, so probably something goes wrong when mounting it.

Is it accessible when you break the boot in the initrd via the rd.break boot option? (on /sysroot)

The /boot is separate but it isn’t close to being full, at least not looking at it through df.

Ok.
Still I would try to recreate the initrd, with “sudo mkinitrd”.

Thanks for hopping on, wolfi. I’ve read enough on here to be worried when you throw out things like “grasping at straws” and the confusion on missing root bit. I just assume you’ve seen everything.

Not really… :wink:
And I have to admit that I nearly have no experience with LVMs.

OK, that narrows it down a bit at least…

I suppose you have the previous version, i.e. the one before the latest update?
Here is the changelog:
https://build.opensuse.org/request/show/570497

Doesn’t really sound like it would cause your problem, maybe it’s more related to updating udev or systemd per se.
Maybe the LVM somewhat stops working during the update (could explain the missing root user), and therefore not all changes can be written (e.g. the new initrd may be corrupted/truncated) or something.
That is just guessing though, maybe the best course of action would be a bug report (if it is reproducible).

Like, that should be mentioned somewhere! Interesting.

That’s not the topic of this thread though, and is rather irrelevant here.

I agree that it doesn’t really sound like it but I can’t argue with the results, either. I think what you mention below feels quite plausible.

This sounds/feels like it could be a thing. I’ve honestly never seen an update break root in this way in my time using linux.

I do have a bug report I filed but I filed it under Maintenance because the error was presenting during upgrade. If we’re talking udev/systemd, perhaps I should change it to Basesystem.

In the openSUSE Forums Terms and Conditions (that you should have read before you signed up as a forum member and that you can revisit by clicking on the link at the bottom of almost every page in these forums), it says:

Once you submit your post in these forums, you have 10 minutes to edit it. After 10 minutes, if you need to, you should post a reply with any corrective information. Why? Two reasons. First, the NNTP protocol doesn’t support editing of posts and so edits won’t transit our gateway unless done before the cron job runs the gateway to sync the messages. Second, editing a post after it has replies could invalidate those replies because the original information changed. Posting a follow up reply with additional/changed information allows any previous reply to stay in context.

And like wolfi323 already said, when you have questions about the forums, there is a sub-forum for that. So please for each subject a different thread in the correct sub-forum. Another way to keep threads understandable and efficient.