No GUI after reboot, no obvious errors (openSuSE 13.2, nvidia drivers)

[Cross-posted: https://unix.stackexchange.com/questions/273497/no-gui-after-reboot-no-obvious-errors-opensuse-13-2-nvidia-drivers]

Short version: Rebooted for the first time in months, following daily updates; proprietary Nvidia drivers; no GUI (KDE); no smoking gun (so far) in logs.

Specs:

  • openSuSE 13.2 (64-bit), fully up to date
  • kernel 3.16.7-35-desktop
  • Nvidia GeForce 970

Long version:
I have a desktop computer with an Nvidia GeForce 970 video card, and two monitors hooked up to it, one via DVI and one via HDMI.
I am running openSuSE 13.2, which I update almost daily, currently at kernel 3.16.7-35-desktop, and use the proprietary Nvidia drivers, currently at 361.28-33.1.x86_64, for the sake of games. That is the latest version of the drivers available from the official repo for openSuSE 13.2, although v361.42 is available for direct download and installation “the hard way”.
Kernel and video drivers had been updated at least once since the last time I rebooted (or even logged out), which was at least a few months ago.

Now I have rebooted, and cannot get back into the GUI (KDE).
The console output (on the DVI monitor, which is where Grub2 normally runs; the HDMI monitor (which is my primary display) now doesn’t get any signal at all) from the boot process is not perfectly consistent in where it stops: sometimes “Reached target Graphical Interface”, sometimes “Starting Command Scheduler…”, sometimes “Started SuSEfirewall2 phase 2”. But it is always after “Started X Display Manager”.
CTRL-ALT-F7 shows a completely blank screen. (1 through 6 are normal console terminals, and 8 shows just a flashing underscore.)

I have force-reinstalled the four Nvidia driver packages (G04) to make sure that they are compiled for my current kernel.
Interestingly, some of the package names are 361.28.k3.16.6_2-33.1.x86_64, which suggests that they are meant for kernel v3.16.6-2, whereas I am running 3.16.7-35. And compiling for 3.16.6-2 fails, with files-not-found /lib/modules/3.16.6-2-desktop/modules.{order|builtin}, although otherwise that folder looks the same as the modules for 3.16.7-35-desktop and 3.16.7-32-desktop.

journalctl --full -b
shows messages from kdm indicating a successful start - eg.: plymouth is active on VT7, reusing for :0.
And kdeinit5: opened connection to :0.
The only messages I can find in the boot logs that might be a cause for concern are:

  • NVRM: Your system is not currently configured to drive a VGA console on the primary VGA device. The NVIDIA Linux graphics driver requires the use of a text0mode VGA console. Use of other console drivers including, but not limited to, vesafb, may result in corruption and stability problems, and is not supported.
  • Registry: Xlib: extension “XEVIE” missing on display :0
  • QXcbConnection: XCB error: 148 (Unknown)

/var/log/Xorg.0.log
has only one message, right at the end of the file, that looked in any way suspicious to me (but I’m no expert):
NVIDIA(0): ACPI: failed to connect to the ACPI event daemon; the daemon may not be running or the “AcpidSocketPath” X configuration option may not be set correctly. When the ACPI daemon is available, the NVIDIA X display driver will try to use it to receive ACPI event notifications.

Booting into Windows, everything works as expected, so I am ruling out an issue with the video card or HDMI monitor.

I have not yet tried uninstalling the Nvidia drivers and switching to nouveau, nor manually updating the Nvidia drivers to 361.42.

I can get into the GUI as normal by rolling the system back to an earlier snapper snapshot, but would rather figure out the current issue than potentially lose months of files.

I am willing to dist-upgrade the system to Leap 42.1, but not without good reason to think that doing so would fix the GUI issue. (Interestingly, the version of the Nvidia drivers from the official repo is the same for 42.1 as it is for 13.2: 361.28.)

So: suggestions on how to fix this, or where to look next?

The flavour of the kernel must match and all packages must be the same version number. In Yast use the version tab to change if needed

You should remove and reinstall not force.

Hi, a few hints from what I saw in a similar situation when the kernel got updated after installation of the nvidia packages.

Drivers version 361.28 are OK, no need to go for 361.42 (unless you are an extreme gamer… and willing to bear the risks :wink: )
Check that nvidia-gfxG04-kmp-desktop-361.28_k3.16.6_2-33.1.x86_64.rpm is installed: as gogalthorp advised, kernel flavor must match.
It should still work with kernel 3.16.7-35-desktop, but please check that at least a symlink to the nvidia.ko module or the module itself is available at /lib/modules/3.16.7-35-desktop/updates or at /lib/modules/3.16.7-35-desktop/weak-updates/updates. If not so, uninstall and reinstall might help, as gogalthorp advised.
After reinstall, issuing as superuser


dracut --host-only --force

to rebuild the initrd might help.

Per @gogalthorp’s suggestion, I tried uninstalling and reinstalling these packages from the official repo: nvidia-computeG04 nvidia-gfxG04-kmp-desktop nvidia-glG04 x11-video-nvidiaG04
That did get me a KDE GUI, but only on one monitor, and with a wildly wrong resolution.
When I then ran the nvidia settings application, to tell it about the second monitor and correct resolutions, it claimed that I didn’t have any nvidia drivers.
It prompted to install them, and I let it go ahead (even though they were already installed).
On reboot, I again had no GUI at all.
That was when I manually installed v361.42.
And that got me my usual GUI, on both monitors, with the correct resolutions.
So I am now unblocked.

But I still have questions, as I would like to understand this business better.

  1. So kernel “flavor” refers only to “default/desktop/pae”, then? (It was definitely the desktop versions I had, of kernel and all driver packages.) I had thought that the drivers also needed to be compiled against the actual kernel version…

  2. Any idea what it was that had broken the linkage between my kernel and the nvidia drivers? I was under the impression that driver updates always accompanied kernel updates.

  3. /lib/modules/3.16.7-35-desktop/weak-updates/updates/ is empty. But /lib/modules/3.16.7-35-desktop/kernel/drivers/video/ does contain nvidia.ko (and -modeset and -uvm) - the actual files, not symlinks - with a timestamp appropriate to when I was doing it. I assume that’s just a difference between manual installation and package installation?
    (Interestingly, the modules directory for the -32 kernel does have symlinks in the updates directory for nvidia-modeset.ko and -uvm ONLY, and they point to the updates directory for the 3.16.6-2 kernel, but that directory doesn’t exist. Judging by the timestamps, those symlinks were created when I uninstalled/reinstalled 361.28 from the repo.)

  4. What is the recommended procedure for switching from the manual installation back to the repo? (All else being equal, I would rather be on the repo; but at this point, I think I’ll wait until the repo catches up with or passes 361.42.) I assume I’ll run /usr/bin/nvidia-uninstall, and then zypper install the driver packages? Is a reboot necessary in between?

Thanks a lot!

Hi, nice to know you are back to business and willing to understand what this mess is all about :wink:
I’m no video guru, so don’t take my words at face value; anyway my understanding is as follows.

That did get me a KDE GUI, but only on one monitor, and with a wildly wrong resolution.

The Nvidia driver didn’t engage, apparently; what you saw seems like the vesa framebuffer or nouveau driver offered a backup option.
Either the nvidia.ko was not included in the initrd at boot time, or it was in the wrong place, or it was not compiled at all.
Your manual install had better chances, apparently.

  1. So kernel “flavor” refers only to “default/desktop/pae”, then? (It was definitely the desktop versions I had, of kernel and all driver packages.) I had thought that the drivers also needed to be compiled against the actual kernel version…

Yes, “flavour” refers only to “desktop” and the like, and yes driver (modules) should be compiled against the running kernel.
But modules compiled for, say, 3.16.6 should work for any 3.16.x and a simple symlink to the right directory in the new kernel tree usually is enough for a minor version update.

  1. Any idea what it was that had broken the linkage between my kernel and the nvidia drivers? I was under the impression that driver updates always accompanied kernel updates.

Maybe the problem with the repo package is that it was never compiled against 3.16.6 in your system, possibly crippling the installer.
Or something is broken in the packaging of nvidia-gfxG04-kmp-desktop-361.28_k3.16.6_2-33.1.x86_64.rpm
When a kernel update comes, external modules are automatically rebuilt only if the “dkms” package is installed and configured and the kernel-devel packages matching the new kernel are installed as well; otherwise, you have to manually reinstall the driver packages while running the updated kernel, in other words > install new kernel > reboot > reinstall drivers > maybe rebuild the initrd.

  1. /lib/modules/3.16.7-35-desktop/weak-updates/updates/ is empty. But /lib/modules/3.16.7-35-desktop/kernel/drivers/video/ does contain nvidia.ko …

I cannot check since I wiped my 13.2 install long ago, sorry.

  1. What is the recommended procedure for switching from the manual installation back to the repo?

Your assumption seems reasonable, but I don’t know for sure; better wait for a video guru to answer that :wink:

General rule is remove then install new

I’m in about the same boat as the OP, with no GUI after boot. I arrived here by a slightly different route; I’ve been away from my desktop for a while (~3 months) so when I got back I applied a lot of updates. First did online update in YAST, then let the update widget do the rest, including Packman and the NVIDIA drivers. Now when I boot my main monitor hangs just after ‘Starting X server’. In /var/log/kdm.log I get the following snippet:

(==) Using system config directory "/usr/share/X11/xorg.conf.d"
modprobe: ERROR: could not find module by name='nvidia'
modprobe: ERROR: could not insert 'nvidia': Function not implemented
modprobe: ERROR: could not insert 'nvidia_uvm': Unknown symbol in module, or unknown parameter (see dmesg)
mknod: missing operand after '0'
Try 'mknod --help' for more information.
(EE) 
Fatal server error:
(EE) no screens found(EE) 
(EE) 

journalctl -b | grep -i nvidia does show a lot of undefined symbols from nvidia_uvm.

I’ve removed and reinstalled the nvidia packages several times, and I have the nvidia.ko, nvidia-modeset.ko, and nvidia-uvm.ko modules in /lib/3.16.6-desktop/updates. In /lib/3.16.7-35-desktop/weak-updates/updates I get symlinks to the nvidia-uvm.ko and nvidia-modeset.ko files, but NOT the nvidia.ko.

Adding the symlink, running dracut and rebooting doesn’t help.

Any other suggestions, or should I follow the OP and fall back to the hard way?

Remove and reinstall the NVIDIA driver. can be done by boot to terminal and running yast or via zypper

Done that several times (login on Alt-Ctl-F1, zypper remove, reboot, and install with YAST), same results.

Another thing I remembered - when this first happened, I was getting permission errors on /dev/dri/card0. I added group ‘video’ to user ‘kdm’ and those errors went away.

Check if you are out of space on the root. Using BTRFS?? how large is root set at? Is snapper running?

df says / was only 62% full, but I deleted a couple of snapshots anyway. However, when I removed and re-installed the nVidia drivers, I got 361.42, so the repo was updated since last night.

One of those two actions solved my problem, and I don’t think I care which one.