openSUSE CUDA Repo broken

The official openSUSE CUDA repo (cuda-opensuse15-x86_64) seems to be broken.

The recommended cuda-cloud-opengpu package was just updated to 590.44.01-1, except the direct mandatory dependency nvidia-open-driver-G07-signed-kmp = 590.44.01 doesn’t exist.

I only noticed this because it broke my MicroOS server when it did the automatic nightly transactional-update. And it remained broken for most of a day since I didn’t figure it out until hours later.

Anyone know how to get transactional-update to not install broken packages so this doesn’t happen again?

@aaravchen that’s the Official CUDA Nvidia repo, might need to jump on their forums and ask there… AFAIK openSUSE has no input to that repository.

No it’s not. OpenSUSE now packages the CUDA packages and NVIDIA drivers as a separate OpenSUSE CUDA repo that’s separate from the NVIDIA official upstream ones. See SDB:NVIDIA drivers - openSUSE Wiki

Ironically, that (unnannounced?) change was the thing that broke MicroOS last month.

Never mind, I’m an idiot who can’t read a URL

…comment edited…

I should give details.

The Opensuse-MicroOS-NVIDIA package installs a new repo that gets named repo-non-free. I thought that was an OpenSUSE maintained repo, but reading the URL, it’s actually an NVIDIA one that’s different from the one you need to install manually for CUDA.

Apparently NVIDIA, not OpenSUSE like I thought, created a cuda-cloud-opengpu package last month that’s a meta package for all the dependencies you need to use CUDA, matching the version across them all. Prior to that you just had to manually install a half dozen packages independently, but making sure they were all the same version. And then NVIDIA would update some of them out of sync with others, so that was always fun to deal with.

The other existing issue is that MicroOS and transactional-update dont support dkms or akmod drivers, so you can only use prebuilt drivers with them. Most notably, that includes the NVIDIA drivers which are now almost exclusuvely distributed as dkms drivers now that the nvidia-open drivers are available. And the downside if prebuilt drivers is they have to target a specific kernel version, rather than a range and automatically rebuilding for the specific kernel they’re being used with.

So it turns out that in the cuda-opensuse-x86_64 repo (owned by NVIDIA) they released a new version of cuda-cloud-opengpu. But the prebuilt driver dependency it requires, that is also supposed to be supplied by the same repo, is missing. Unfortunately when you do an automatic transac6ional-update though, you have to specify -y to zpper so it will automatically confirm that you want to perform the updates in the dup sub command, but that -y also confirms you want to continue installing a package with broken dependencies. zypper simply has no options to differentiate or prevent this.

So I got a broken install of the updated cuda-cloud-opengpu package that was missing the driver.

While fixing this, I discovered that the repo-non-free (installed by an OpenSUSE maintained package, but pointing to an NVIDIA owned repo) aso contained cuda-cloud-opengpu, but didn’t have the newer version that was missing a dependnecy. However the repo-non-free is missing virtually all CUDA dependencies of that meta package, so you do still have to have cuda-opensuse-x86_64 installed and enabled for the dependencies to be available. So ironically you are best to install cuda-cloud-opengpu from the repo-non-free (which has none of the dependencies of it) so that I updates without vendor switching will only update to versions with all the existing dependencies, but you need the cuda-opensuse-x86_64 to get those dependencies, the same one that will try to give you a broken cuda-cloud-opengpu meta package.

Keeping the fun going, OpenSUSE also released the new 6.18 kernel as kernel-default. Since transactional-update can only use prebuilt drivers, and NVIDIA is the only source of the NVIDIA drivers, NVIDIA has to release a new prebuilt driver for use on the 6.18 kernel. They lag on that, so when the cuda-cloud-gpu installs the NVIDA driver as a dependency, it ends up installing the latest one, which still targets the 6.17 kernel.


What this means is I got the double-whammy, and OpenSUSE hasn’t stabilized the use if CUDA like I thought they had. Other distros have had to implement package manager plugins (e.g. dnf-nvidia-plugin) for getting the proper version restrictions between kernel, drivers, and libraries, but then again they also all support either dmks or akmods now (even the immutable ones).

My solution ended up needing to be a sledgehammer since this is a largely unattended server that I want to get automatic updates so it’s largely set it and forget it. I had to add a health-checker script (a service which is never linked to or mentioned by name in any transactional cuspate documentation BTW) that will fail the boot if the nvidia0device Nide is missing, so that it will rollback every nightly automatictransactional-updateupgrade that break the driver stack until there is a kernel + driver +cuda-cloud-opengpu` that actually work together properly.

@aaravchen Why not run workloads in containers so it’s independent of the OS. Then should not need any host drivers…

@aaravchen Or just roll back to your last known good snapshot, and stay on it, until NVidia updates their driver package.

I am exclusively running workloads in containers. But in order to get GPU access in a container, you have to have the drivers on the host (since containers share the kernel with the host). No matter whether you use the older NVIDIA runc-wrapped-and-injected method or the newer CDI standardized injection method, NVIDIA drivers require the drivers themselves, and some basic host utilities for managing thr injection (i.e. the runc wrapper or the CDI description file generator and driver-matches interface libraries).

That’s what I ended up having to do.
My wall of text description ended with an explanation that that seems to be the only path.

There’s no way to tell zypper to install what I told it, without also telling it to go ahead and install broken packages. The -y option confirms both or neither. So I can’t have the automatically run transactional-update avoid package breakage when upstream is in a bad state (NVIDIA’s bad cuda-cloud-opengpu meta package on one of their two repos).

There doesn’t seem to be any way to tell zypper not to upgrade the kernel if the NVIDIA driver doesnt support that kernel yet, short of writing a custom zypper plugin.

So I added a script to the health-checker that verifies the device node is present. The health-checker is the (mostly undocumented) boot-time check for a failed new state, and is what transactional-update actually relies on for the automatic rollback behavior.

Downside, it will still try every night to do updates and then reboot, but it will fail the entire update transaction on every boot where the driver and kernel mismatch or a broken version of the driver stack meta package is installed.

@aaravchen then I would lock the packages… transactional-update run zypper al ....

These are podman containers or kubernetes?

Locking the packages is another option. But the lack of dependency restrictions between the GPU driver and the kernel version it’s built for is an issue. I’d have to lock my kernel version as well, which I’d really prefer not to do.

The container engine doesn’t really matter. Docker, podman, k3s, and K8s all use the same underlying method for getting GPU access inside containers. NVIDIA originally came up with it for Docker (possibly for support of K8s when docker was still a supported back end?). It basically has to make sure the drivers are loaded on the host, then mount in the device nodes and bind mount in some specific libraries so that they exactly match the drivers. The original solution from NVIDIA actually used a patched version of runc to read container specific environment variables and custom CLI options (i.e. --gpus=all) to decide which GPUs to map into the container, but then runc added support for hooks. NVIDIA turned their patched runc into a wrapper around runc instead, with some scripts and a runc hook. That solution got made generic by OCI, which gave us oci hooks, but still relied on the hooked/wrapped being a “different engine”. Later this got made even more generic with CDI (container device interface?). CDI is the modern standard, and was first adopted by K8s. It basically has you run a generation script on your host for the specific resources (I.e. nvidia-ctk) to create a CDI JSON file, then the engines themselves (podman, k3s, K8s, and very recently finally docker) perform the mappings automatically. For CLI engines this looks like --device nvidia.com/gpu=all for example.

I run K3s on Leap 16.0 here, for me better control and my needs are simple, I use a Nvidia Tesla P4 with workloads and nvidia-container-toolkit.

Hasn’t it all been combined now, nvidia-container toolkit, cuda etc with the new repository? I’m assuming to avoid these issues…

Sort of. NVIDIA used to have a single repo that had everything. But it turns out the NVIDIA - container - tool lot package itself is just a handful fo scripts that are distro agnostic and work across a large range of driver and kernel versions. They split just the container toolkit out into it’s own repo about 9 months ago, and jeep a second copy of it in thr main repo only for distro versions that started with it. The NVIDIA repo for Fedora 42 for example, that came out this past spring, no longer includes the container tools package in it.

Otherwise, yes, NVIDIA has everything in a single repo. Which makes it really weird that the OpeSUSE package that installs the NVIDIA repo is pointing to something hosted by NVIDIA that does not have everything in it. I don’t know whether it’s a deprecated NVIDIA location it points to, or just a legacy one (it’s actually an OpenSUSE 15 repo from NVIDIA so it’s been around for a very long time), but my first thought was to check that one for all the dependencies, and it clearly does not have all of them.
NVIDIA also makes it more difficult to investigate these types of things because they dont offer the standard web-based FTP-looking interface that most RPM repos do. So it’s hard to investigate what does and doesn’t exist.

Look here? https://developer.download.nvidia.com/compute/cuda/repos/ or can always add and look with zypper or myrlyn.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.