Problems adding nvidia proprietary drivers to initrd file

I suppose you could try to remove Plymouth from the initrd, so that it gets started later when the system switches to the real /.

This is what the nvidia rpms do:

omit_dracutmodules+="plymouth"
add_drivers+="nvidia nvidia-drm nvidia-modeset nvidia-uvm" 

(but I have no idea whether plymouth works with them installed now or not)

By adding the force_drivers+= to my dracut nvidia.conf file, the nVidia drivers appear to get loaded earlier, but I noticed by reading the journal that they’re tainting the kernel because they’re not in the kernel subdirectory. When I was at the shell though, I saw they where loaded…dmeg shows:

5.092835] nvidia: loading out-of-tree module taints kernel.
5.107163] nvidia: module license 'NVIDIA' taints kernel.
5.121416] Disabling lock debugging due to kernel taint

I think I should probably fix that before continuing any further. I have to figure out how.

The nvidia driver taints the kernel because it is a third party module (with an incompatible license even), not part of the kernel itself.
You cannot fix that really.

Shouldn’t make a difference though, except that bug reports against the kernel may get rejected… :wink:

They’re being installed in the /lib/modules/uname -r/updates directory because of DKMS. If I disable the DKMS option, they get installed in the proper location.

I read that it could disable certain API functions and thought that was possibly a problem. Thanks for the clarification.

So, I came to the same conclusion that you came to, about trying to disable Plymouth in the initrd and waiting for the real Plymouth starts. I went about it by modifying /etc/default/grub
and changing:


GRUB_CMDLINE_LINUX_DEFAULT

to have a


rd.plymouth=0

Then I ran grub2-mkconfig and dracut again.

I don’t see where dracut reads those values though. I might try it the way the RPM does it, although the RPM didn’t show a graphical boot screen either.

I see this in the journal:


Feb 03 16:09:53 eugene kernel: DMAR: Forcing write-buffer flush capability
Feb 03 16:09:53 eugene kernel: DMAR: Disabling IOMMU for graphics on this chipset

My board doesn’t have built-in video, nor does my old Intel(R) Core™2 Extreme CPU X9770 support built-in video. My understanding is IOMMU is disabled for my nVidia card. But I don’t think that’s causing any issues. From my understanding, that will only be an issue if I’m running a VM and using passthrough, which I am, but not on this OpenSuSE system. I do that on my actual server, the HP DL380 Gen9, which runs CentOS 7 right now.

I see a few other things in the journal which make me wonder if they’re causing problems or not.

Because you guys are more knowledgeable about Linux than I am, and my google searches haven’t really turned up anything that seems to suggest it’s an issue related to my problem, I’ll just post them here for you to see:


Feb 03 16:09:53 eugene kernel: x86: Booting SMP configuration:
Feb 03 16:09:53 eugene kernel: .... node  #0, CPUs:      #1
Feb 03 16:09:53 eugene kernel: TSC synchronization [CPU#0 -> CPU#1]:
Feb 03 16:09:53 eugene kernel: Measured 2150946 cycles TSC warp between CPUs, turning off TSC clock.
Feb 03 16:09:53 eugene kernel: tsc: Marking TSC unstable due to check_tsc_sync_source failed
Feb 03 16:09:53 eugene kernel:  #2 #3
Feb 03 16:09:53 eugene kernel: smp: Brought up 1 node, 4 CPUs
Feb 03 16:09:53 eugene kernel: smpboot: Total of 4 processors activated (25600.20 BogoMIPS)
...
Feb 03 16:09:53 eugene kernel: ACPI: bus type PCI registered
Feb 03 16:09:53 eugene kernel: acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
Feb 03 16:09:53 eugene kernel: PCI: MMCONFIG for domain 0000 [bus 00-3f] at [mem 0xe8000000-0xebffffff] (base 0xe8000000)
Feb 03 16:09:53 eugene kernel: PCI: MMCONFIG at [mem 0xe8000000-0xebffffff] reserved in E820
Feb 03 16:09:53 eugene kernel: pmd_set_huge: Cannot satisfy [mem 0xe8000000-0xe8200000] with a huge-page mapping due to MTRR override.
Feb 03 16:09:53 eugene kernel: PCI: Using configuration type 1 for base access
Feb 03 16:09:53 eugene kernel: HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
...
Feb 03 16:09:53 eugene kernel: acpi PNP0A03:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
Feb 03 16:09:53 eugene kernel: acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
...
Feb 03 16:09:53 eugene kernel: DMAR: Forcing write-buffer flush capability
Feb 03 16:09:53 eugene kernel: DMAR: Disabling IOMMU for graphics on this chipset
...
Feb 03 16:09:53 eugene kernel: pci 0000:00:1f.0: quirk: [io  0x0400-0x047f] claimed by ICH6 ACPI/GPIO/TCO
Feb 03 16:09:53 eugene kernel: pci 0000:00:1f.0: quirk: [io  0x0480-0x04bf] claimed by ICH6 GPIO
...
Feb 03 16:09:53 eugene kernel: pci 0000:03:00.0: disabling ASPM on pre-1.1 PCIe device.  You can enable it with 'pcie_aspm=force'
...
Feb 03 16:09:53 eugene kernel: pci 0000:01:00.0: vgaarb: setting as boot VGA device
Feb 03 16:09:53 eugene kernel: pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
Feb 03 16:09:53 eugene kernel: pci 0000:01:00.0: vgaarb: bridge control possible
Feb 03 16:09:53 eugene kernel: vgaarb: loaded
...
Feb 03 16:09:53 eugene kernel: pci 0000:01:00.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
Feb 03 16:09:53 eugene kernel: pci 0000:03:00.0: async suspend disabled to avoid multi-function power-on ordering issue
Feb 03 16:09:53 eugene kernel: pci 0000:03:00.1: async suspend disabled to avoid multi-function power-on ordering issue
...
<after initrd loads>
Feb 03 16:09:53 eugene kernel: vesafb: mode is 1920x1080x32, linelength=7680, pages=0
Feb 03 16:09:53 eugene kernel: vesafb: scrolling: redraw
Feb 03 16:09:53 eugene kernel: vesafb: Truecolor: size=8:8:8:8, shift=24:16:8:0
Feb 03 16:09:53 eugene kernel: vesafb: framebuffer at 0xe7000000, mapped to 0xffff994441800000, using 8128k, total 8128k
Feb 03 16:09:53 eugene kernel: Console: switching to colour frame buffer device 240x67
Feb 03 16:09:53 eugene kernel: fb0: VESA VGA frame buffer device
...
Feb 03 16:09:53 eugene systemd-sysctl[180]: Couldn't write '0' to 'dev/cdrom/autoclose', ignoring: No such file or directory
Feb 03 16:09:53 eugene systemd-vconsole-setup[173]: /usr/bin/setfont failed with error code 71.
Feb 03 16:09:53 eugene systemd-vconsole-setup[173]: PIO_UNIMAPCLR: Input/output error
Feb 03 16:09:53 eugene systemd-vconsole-setup[173]: Setting fonts failed with a "system error", ignoring.
Feb 03 16:09:53 eugene dracut-cmdline[189]: dracut- dracut-044-10.2 dracut-044-10.2
...
Feb 03 16:09:53 eugene dracut-cmdline[189]: Using kernel command line parameters: rd.driver.pre=nvidia rd.driver.pre=nvidia_modeset rd.driver.pre=nvidia_uvm rd.driver.pre=nvidia_drm resume=UUID=ac985b5e-fb17-442f-b6ac-24c1c53e5bff root=UUID=3c6e7faf-093a-49df-83db-ca247620f093 rootfstype=ext4 rootflags=rw,relatime,data=ordered BOOT_IMAGE=/vmlinuz-4.14.15-1-default root=UUID=3c6e7faf-093a-49df-83db-ca247620f093 resume=/dev/sda2 splash quiet showopts nvidia-drm.modeset=1
Feb 03 16:09:53 eugene kernel: ipmi message handler version 39.2
Feb 03 16:09:53 eugene kernel: ipmi device interface
Feb 03 16:09:53 eugene kernel: nvidia: loading out-of-tree module taints kernel.
Feb 03 16:09:53 eugene kernel: nvidia: module license 'NVIDIA' taints kernel.
Feb 03 16:09:53 eugene kernel: Disabling lock debugging due to kernel taint
Feb 03 16:09:53 eugene kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 247
Feb 03 16:09:53 eugene kernel: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
Feb 03 16:09:53 eugene kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  390.25  Wed Jan 24 20:02:43 PST 2018 (using threaded interrupts)
Feb 03 16:09:53 eugene kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.25  Wed Jan 24 19:29:37 PST 2018
Feb 03 16:09:53 eugene kernel: nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 246
Feb 03 16:09:53 eugene kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Feb 03 16:09:54 eugene kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Feb 03 16:09:54 eugene kernel: caller _nv001170rm+0xe3/0x1d0 [nvidia] mapping multiple BARs
Feb 03 16:09:54 eugene kernel: nvidia-modeset: Allocated GPU:0 (GPU-840f2135-d35a-2aa8-4af4-02d87a8f3fc8) @ PCI:0000:01:00.0
Feb 03 16:09:54 eugene kernel: [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
Feb 03 16:09:54 eugene kernel: [drm] No driver support for vblank timestamp query.
Feb 03 16:09:54 eugene kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
Feb 03 16:09:54 eugene systemd[1]: Started dracut pre-udev hook.
-- Subject: Unit dracut-pre-udev.service has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit dracut-pre-udev.service has finished starting up.
-- 
-- The start-up result is done.
...
Feb 03 16:09:54 eugene systemd[1]: Started Show Plymouth Boot Screen.
-- Subject: Unit plymouth-start.service has finished start-up
...
Feb 03 16:10:00 eugene kernel: audit: type=1400 audit(1517692200.420:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="ping" pid=446 comm="apparmor_parser"
Feb 03 16:10:00 eugene kernel: audit: type=1400 audit(1517692200.708:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="klogd" pid=456 comm="apparmor_parser"
Feb 03 16:10:00 eugene systemd-udevd[447]: invalid key/value pair in file /etc/udev/rules.d/99-usb-serial.rules on line 1, starting at character 42 (',')
Feb 03 16:10:00 eugene systemd-udevd[447]: invalid key/value pair in file /etc/udev/rules.d/99-usb-serial.rules on line 2, starting at character 25 (',')
Feb 03 16:10:00 eugene systemd[1]: Started udev Kernel Device Manager.
...
Feb 03 16:10:01 eugene kernel: ACPI Warning: SystemIO range 0x0000000000000428-0x000000000000042F conflicts with OpRegion 0x000000000000042C-0x000000000000042D (\GP2C) (20170728/utaddress-247)
Feb 03 16:10:01 eugene kernel: ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
Feb 03 16:10:01 eugene kernel: lpc_ich: Resource conflict(s) found affecting gpio_ich
...
Feb 03 16:10:06 eugene smartd[1829]: Device: /dev/sdb [SAT], WDC WD2002FAEX-007BA0, S/N:WD-WMAY02244308, WWN:5-0014ee-6012656ea, FW:05.01D05, 2.00 TB
Feb 03 16:10:06 eugene smartd[1829]: Device: /dev/sdb [SAT], found in smartd database: Western Digital Caviar Black
Feb 03 16:10:06 eugene smartd[1829]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
Feb 03 16:10:06 eugene smartd[1829]: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD2002FAEX_007BA0-WD_WMAY02244308.ata.state
Feb 03 16:10:06 eugene smartd[1829]: Monitoring 2 ATA/SATA, 0 SCSI/SAS and 0 NVMe devices
Feb 03 16:10:06 eugene smartd[1829]: Device: /dev/sda [SAT], 16 Currently unreadable (pending) sectors
Feb 03 16:10:06 eugene smartd[1829]: Device: /dev/sda [SAT], 16 Offline uncorrectable sectors
Feb 03 16:10:06 eugene smartd[1829]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 55 to 56
Feb 03 16:10:06 eugene smartd[1829]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 45 to 44
















It looks like perhaps my main hard drive might be dying, which is no good.

But this also shows what you said, nvidia drivers are being loaded the same second Plymouth is being started. I find it hard to believe I couldn’t somehow insert just a short pause in there somehow, either by writing a udev rule that executes a script or modifying the systemd plymouth-start.service file that inserts a short pause, maybe a second or two…or somehow have the plymouth-start.service script try calling an nVidia API (or whatever we call them in Linux) and wait until that API is successfully called…then, continue with starting Plymouth.

Granted, this would be a specific fix just for my system, but it’d make me happy…

Hard drive died.

It was a 3TB Seagate. Never had good luck with Seagates. Got a WD Gold ordered, 4TB, Enterprise class. Once it arrives, I’ll install an OS and then try to recover some data. Had /home on it’s own partition and seeing how grub wouldn’t even execute, I’m hoping the beginning of the hard drive was damaged and not so much the latter on sections.

Unfortunately, I’m very sick, lost a lot of weight real quick like, don’t have much energy, and haven’t made a backup in a while. I was working on some stuff and like an idiot, just kept putting off the backup until I got something a bit more stable. Hopefully I can recover it. Until then, we’ll have to put this on pause.

Just wanted to update. I put CentOS 7 on, and find it odd that if this is a race condition issue, why other distributions don’t seem to have any trouble loading Plymouth. I believe even non-tumbleweed OpenSuSE shows the Plymouth bootup splash screen correctly.

Not sure it’s really a race condition that’s causing this problem.

Plymouth looks fine with the nouveau driver.

Have you looked at Chapter 33. Direct Rendering Manager Kernel Modesetting (DRM KMS)

The arch wiki says:

nvidia 364.16 adds support for DRM kernel mode setting. To enable this feature, add the nvidia-drm.modeset=1 kernel parameter, and add nvidia, nvidia_modeset, nvidia_uvm and nvidia_drm to your initramfs#MODULES.

I never got it working; I’m not sure how to add those things (e.g. nvidia_uvm).

I disabled plymouth with [plymouth.enable=0](https://en.opensuse.org/SDB:NVIDIA_the_hard_way#Kernel_mode-setting) and actually like seeing that it’s “wicked m” doing something
rather than a generic “something is happening” screen.

That said, I’m still interested in using a virtual console with a higher resolution.
Let us know if you ever figure it out. :-]

I can reliably reproduce this problem if I start VM with cold caches (after “echo 3 > /proc/sys/vm/drop_caches”). And if I compare plymouth traces from working and non-working cases I have:

text splash (non-working):

[ply-device-manager.c:815]                      create_devices_from_udev:Timeout elapsed, looking for devices from udev[ply-device-manager.c:302]                  create_devices_for_subsystem:creating objects for drm devices
[ply-device-manager.c:319]                  create_devices_for_subsystem:found device /sys/devices/pci0000:00/0000:00:01.0/drm/card0
[ply-device-manager.c:342]                  create_devices_for_subsystem:it's not initialized
[ply-device-manager.c:319]                  create_devices_for_subsystem:found device /sys/devices/pci0000:00/0000:00:01.0/drm/card0/card0-Virtual-1
[ply-device-manager.c:342]                  create_devices_for_subsystem:it's not initialized
[ply-device-manager.c:319]                  create_devices_for_subsystem:found device /sys/devices/virtual/drm/ttm
[ply-device-manager.c:342]                  create_devices_for_subsystem:it's not initialized
[ply-device-manager.c:302]                  create_devices_for_subsystem:creating objects for frame buffer devices
[ply-device-manager.c:319]                  create_devices_for_subsystem:found device /sys/devices/pci0000:00/0000:00:01.0/graphics/fb0
[ply-device-manager.c:342]                  create_devices_for_subsystem:it's not initialized
[ply-device-manager.c:319]                  create_devices_for_subsystem:found device /sys/devices/virtual/graphics/fbcon
[ply-device-manager.c:342]                  create_devices_for_subsystem:it's not initialized
[ply-device-manager.c:823]                      create_devices_from_udev:Creating non-graphical devices, since there's no suitable graphics hardware
...
[ply-device-manager.c:365]                                 on_udev_event:got add event for device card0
[ply-device-manager.c:377]                                 on_udev_event:ignoring since we're already using text splash for local console
[ply-device-manager.c:365]                                 on_udev_event:got add event for device fb0
[ply-device-manager.c:381]                                 on_udev_event:ignoring since we only handle subsystem graphics devices after timeout
[ply-device-manager.c:365]                                 on_udev_event:got add event for device card0-Virtual-1
[ply-device-manager.c:377]                                 on_udev_event:ignoring since we're already using text splash for local console

Frame-buffer splash (working):

[ply-device-manager.c:815]                      create_devices_from_udev:Timeout elapsed, looking for devices from udev
[ply-device-manager.c:302]                  create_devices_for_subsystem:creating objects for drm devices
[ply-device-manager.c:319]                  create_devices_for_subsystem:found device /sys/devices/pci0000:00/0000:00:01.0/drm/card0
[ply-device-manager.c:326]                  create_devices_for_subsystem:device is initialized
[ply-device-manager.c:335]                  create_devices_for_subsystem:found node /dev/dri/card0
[ply-device-manager.c:229]                create_devices_for_udev_device:device subsystem is drm
[ply-device-manager.c:244]                create_devices_for_udev_device:found DRM device /dev/dri/card0
[ply-device-manager.c:247]                create_devices_for_udev_device:forcing use of framebuffer for cirrusdrmfb
[ply-device-manager.c:319]                  create_devices_for_subsystem:found device /sys/devices/pci0000:00/0000:00:01.0/drm/card0/card0-Virtual-1
[ply-device-manager.c:326]                  create_devices_for_subsystem:device is initialized
[ply-device-manager.c:319]                  create_devices_for_subsystem:found device /sys/devices/virtual/drm/ttm
[ply-device-manager.c:342]                  create_devices_for_subsystem:it's not initialized
[ply-device-manager.c:302]                  create_devices_for_subsystem:creating objects for frame buffer devices
[ply-device-manager.c:319]                  create_devices_for_subsystem:found device /sys/devices/pci0000:00/0000:00:01.0/graphics/fb0
[ply-device-manager.c:326]                  create_devices_for_subsystem:device is initialized
[ply-device-manager.c:335]                  create_devices_for_subsystem:found node /dev/fb0
[ply-device-manager.c:229]                create_devices_for_udev_device:device subsystem is graphics
[ply-device-manager.c:251]                create_devices_for_udev_device:found frame buffer device /dev/fb0
[ply-device-manager.c:699] create_devices_for_terminal_and_renderer_type:creating devices for /dev/fb0 (renderer type: 2) (terminal: /dev/tty7)

So what happens here - plymouth enumerates graphical devices, but udev did not yet complete processing of these devices so plymouth ignores them and falls back to text mode. Later when udev completes initialization it sends event but plymouth ignores it because it already selected splash backend.

This is clear bug in plymouth. It should wait for udev to finish initialization.

I submitted Plymouth bug https://bugs.freedesktop.org/show_bug.cgi?id=105141

https://www.novell.com/support/kb/doc.php?id=3111917

So in /etc/sysconfig/kernel

INITRD_MODULES="nvidia nvidia_modeset nvidia_uvm nvidia_drm"

then

mkinitrd

and

nvidia-drm.modeset=1

in YaST2 -> Boot Loader -> Kernel Parameters or /etc/default/grub

Does that seem correct?

Yes, I did that. If you read through the posts, you see where I create an nvidia.conf dracut configuration file in /etc/dracut.d That file makes sure the nVidia modules get loaded in the initramfs.

There’s a few ways to add the module parameters, like nvidia-drm.modeset=1. I created a special modprobe conf file. My dracut nvidia.conf file includes that special nVidia modprobe conf file, that makes sure the modules get loaded with the proper parameters.

I added them using the modprobe command, e.g.


modprobe nvidia
modprobe nvidia_uvm
modprobe nvidia_drm
modprobe nvidia_modeset

mkinitrd

# lsmod | grep nvidia
nvidia_uvm            806912  0 
nvidia_drm             49152  2 
nvidia_modeset       1097728  3 nvidia_drm
nvidia              14344192  85 nvidia_modeset,nvidia_uvm
drm_kms_helper        155648  2 i915,nvidia_drm
drm                   397312  6 i915,drm_kms_helper,nvidia_drm
ipmi_msghandler        53248  1 nvidia

I guess that means it worked.
I also added nvidia-drm.modeset=1 to the boot options, but it didn’t seem to have the desired effect.

Leap 42.3

Thank you for submitting a bug. I put a 3 second delay in in the systemctl service file for plymouth-start or whatever it’s called, and that should have been more than enough time for the udev rules to finish. I was working on adding more time (15 seconds), but I can’t remember if I ever actually did that before the hard drive died or not. Anyway, thanks for submitting the bug report and showing us the trace.

Regardless of any plymouth bug (I uninstalled it)
the virtual terminal resolution is still very low.

I tried various tweaking of the settings found:

https://askubuntu.com/questions/18444/how-do-i-increase-console-mode-resolution

https://unix.stackexchange.com/questions/17027/how-to-set-the-resolution-in-text-consoles-troubleshoot-when-any-vga-fail

No luck so far.

Actually after a reboot I see:

# lsmod | grep nvidia
nvidia_drm             49152  2 
nvidia_modeset       1097728  3 nvidia_drm
nvidia              14344192  84 nvidia_modeset
ipmi_msghandler        53248  1 nvidia
drm_kms_helper        155648  2 i915,nvidia_drm
drm                   397312  6 i915,drm_kms_helper,nvidia_drm

The nvidia_uvm is missing.

I got it! :smiley:

In /etc/default/grub
[FONT=arial]
I changed [/FONT]

GRUB_GFXMODE="auto"

to

GRUB_GFXMODE=1920x1080x32
GRUB_GFXPAYLOAD_LINUX=keep

then ran

grub2-mkconfig -o /boot/grub2/grub.cfg

I think possibly my problem before was choosing a resolution not listed by the grub command: vbeinfo

Now I have full HD grub2 and virtual console. :cool:

It seems like that can be more easily accomplished with:
**yast2 > Boot Loader > Kernel Parameters > Console resolution

**
Either way, it seems that the nvidia modules don’t need to be part of the initial ramdisk for that to work.

However, I did verify that adding

force_drivers+="*nvidia nvidia_drm nvidia_modeset nvidia_uvm*"

to /etc/dracut.conf.d/01-dist.conf
then running dracut -f does add them:

lsinitrd | grep nvidia

nvidia-drm.modeset=1
in the kernel parameters seems to work according to

cat /sys/module/nvidia_drm/parameters/modeset