Cannot get CUDA to work

Sorry the swearing but HOLY ■■■■ I MADE IT TO WORK OMFGGGGGG (I’M SO HAPPY //>w<// )

Ok so… I went to Yast and removed EVERYTHING related to NVIDIA, every fricking thing. No need to remove the NVIDIA repository. Reboot.
Next, i installed the drivers with the recommended command from the Wiki + the driver and firmware.

sudo zypper install-new-recommends --repo NVIDIA:repo-non-free
sudo zypper in nvidia-drivers-G06 kernel-firmware-nvidia-gspx-G06

Reboot, enroll the keys.
Done. Just that. No .run file needed.

I don’t know how that worked. Bu for record, i’ve before that tried to just import the Leap repository and install the drivers and CUDA libraries ignoring file conflicts, but without sucess, so i later uninstalled everything and removed the repositories.

FINALLY AAAAAAAAA
OptiX is also detectable (but Blender crashes as on Windows).
Now, AMD HIP is the next…

Awww cmon, i had to reboot because the suspend glitch the NVIDIA drivers does (freezes the system) and now it isn’t working anymore. ¿WHY?

@JoseskVolpe check the output from journalctl -b Are you running Xorg or Wayland? Did the likes of suse-prime get installed?

There’s nothing relevant to NVIDIA nor CUDA

Wayland, and the time it was working i was using Wayland aswell.

No.
But cmon, it was working TwT

@JoseskVolpe So is the nvidia driver loaded?
/sbin/lspci -nnk | grep -EA3 "VGA|Display|3D"

Has nouveau been blacklisted?
/sbin/modprobe -c | grep -E "blacklist nouveau"

Does the nvidia driver match the running kernel?
/sbin/modinfo nvidia | grep filename
uname -a

Check wayland is actually running
echo $XDG_SESSION_TYPE

Check suse-prime is not installed (It has a habit of that for dual graphics…)
zypper se suse-prime

So you went into windows and tested blender, it crashed, then you booted back into Tumbleweed and nvidia not working?

Yes

Yes

Yes

Yes

Not installed

I made it to work in Tumbleweed, switched to OptiX then it crashed, so i switched back to CUDA. I took the power plug off to move the laptop but it had frozen during the suspension after i closed the lid so i had to reboot (NVIDIA driver and their power management are crappy), plugged it back in and CUDA wasn’t working anymore.

@JoseskVolpe Any response of your Nvidia thread? I would set persistece or add the power options?

cat /etc/systemd/system/nvidia-persistence.service 
# /etc/systemd/system/nvidia-persistence.service
#
[Unit]
Description=Systemd service for enabling persistence mode on gpus

[Service]
Type=forking
ExecStart=/usr/bin/nvidia-persistenced --user <your_username>
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target
1 Like

Yes, i’ve been requested to send debug logs

¿Isn’t persistence mode bad for power efficiency? Like, it’ll quickly drain my battery even when i’m not using the GPU. ¿Is it mandatory for CUDA?

@JoseskVolpe No just run to test and see if that helps… you can stop start as required…

Enabled and started, didn’t worked, so i rebooted, now it works. Something seems to be blocking CUDA from starting, that’s why it won’t work.
I won’t consider it as solved as i’d to keep persistence mode enabled and it has a power cost.

When the GPU is not being used for any purpose (i.e. it is idle, technically: no contexts of any kind are instantiated on the GPU) and persistence mode is not enabled, the GPU, in concert with the GPU driver, will automatically reduce its power state to a very low level, sometimes including a complete power-off scenario.
What does "persistence mode" actually do which reduces CUDA startup time? - Stack Overflow

@JoseskVolpe See the Known issues at the end of https://download.nvidia.com/XFree86/Linux-x86_64/555.42.02/README/powermanagement.html

I would also switch to Xorg and see how it goes…

Ok thats weird. I was testing on Blender but then i suddendly noticed it stopped using the GPU and was stressing out the CPU, so i restarted Blender and it wasn’t detecting CUDA anymore.
Seems like CUDA server is also crashing.

@JoseskVolpe sure it’s not a real hardware issue? I’ve never had cuda not working, in saying that there were issues with blender and my AMD/Nvidia setup…

If you install nvtop and monitor the GPU’s in a separate terminal to see what is happening, not overheating?

Rebooted, CUDA isn’t working anymore even though persistence mode is enabled

It seems like to be a software issue. Temperature right now is 42°C but CUDA still doesn’t works, it never reached over 86ºC. Also CUDA never stopped working while testing on Windows.

@JoseskVolpe If you run the test script to check all the cuda components, is that working?

What does inxi -GSaz show.

Yes

$ ./detect_cuda.sh 
        libcudart.so.12 -> libcudart.so.12.4.99
        libcuda.so.1 -> libcuda.so.550.90.07
        libcudadebugger.so.1 -> libcudadebugger.so.550.90.07
        libcuda.so.1 -> libcuda.so.550.90.07
libcuda is installed
        libcudart.so.12 -> libcudart.so.12.4.99
libcudart is installed
ERROR: libnccl is NOT installed
        libcudnn.so.9 -> libcudnn.so.9.2.0
        libcudnn_ops.so.9 -> libcudnn_ops.so.9.2.0
        libcudnn_heuristic.so.9 -> libcudnn_heuristic.so.9.2.0
        libcudnn_graph.so.9 -> libcudnn_graph.so.9.2.0
        libcudnn_engines_runtime_compiled.so.9 -> libcudnn_engines_runtime_compiled.so.9.2.0
        libcudnn_engines_precompiled.so.9 -> libcudnn_engines_precompiled.so.9.2.0
        libcudnn_cnn.so.9 -> libcudnn_cnn.so.9.2.0
        libcudnn_adv.so.9 -> libcudnn_adv.so.9.2.0
libcudnn is installed
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0
$ inxi -GSaz
System:
  Kernel: 6.9.5-1-default arch: x86_64 bits: 64 compiler: gcc v: 13.3.0
    clocksource: tsc avail: hpet,acpi_pm
    parameters: BOOT_IMAGE=/vmlinuz-6.9.5-1-default
    root=/dev/mapper/OpenSUSE-SYSTEM splash=silent resume=/dev/OpenSUSE/SWAP
    quiet pcie_aspm=force acpi_backlight=native security=apparmor rd.shell=0
    mitigations=auto
  Desktop: KDE Plasma v: 6.0.5 tk: Qt v: N/A info: frameworks v: 6.2.0
    wm: kwin_wayland tools: avail: xscreensaver vt: 2 dm: SDDM Distro: openSUSE
    Tumbleweed-Slowroll 20240605
Graphics:
  Device-1: NVIDIA GA107M [GeForce RTX 3050 Mobile]
    vendor: Acer Incorporated ALI driver: nvidia v: 550.90.07
    alternate: nouveau,nvidia_drm non-free: 550.xx+ status: current (as of
    2024-04; EOL~2026-12-xx) arch: Ampere code: GAxxx process: TSMC n7 (7nm)
    built: 2020-2023 pcie: gen: 1 speed: 2.5 GT/s lanes: 8 link-max: gen: 4
    speed: 16 GT/s lanes: 16 ports: active: none empty: HDMI-A-1
    bus-ID: 01:00.0 chip-ID: 10de:25a2 class-ID: 0300
  Device-2: AMD Rembrandt [Radeon 680M] vendor: Acer Incorporated ALI
    driver: amdgpu v: kernel arch: RDNA-2 code: Navi-2x process: TSMC n7 (7nm)
    built: 2020-22 pcie: gen: 4 speed: 16 GT/s lanes: 16 ports: active: eDP-1
    empty: DP-1, DP-2, DP-3, DP-4, DP-5, DP-6, DP-7, DP-8, Writeback-1
    bus-ID: 75:00.0 chip-ID: 1002:1681 class-ID: 0300 temp: 43.0 C
  Device-3: Chicony ACER HD User Facing driver: uvcvideo type: USB rev: 2.0
    speed: 480 Mb/s lanes: 1 mode: 2.0 bus-ID: 5-1:2 chip-ID: 04f2:b76f
    class-ID: fe01 serial: <filter>
  Display: wayland server: X.org v: 1.21.1.12 with: Xwayland v: 24.1.0
    compositor: kwin_wayland driver: X: loaded: amdgpu,nvidia
    unloaded: fbdev,modesetting,vesa alternate: nouveau,nv dri: radeonsi
    gpu: nvidia,amdgpu display-ID: 0
  Monitor-1: eDP-1 res: 1536x864 size: N/A modes: N/A
  API: EGL v: 1.5 hw: drv: nvidia drv: amd radeonsi platforms: device: 0
    drv: nvidia device: 1 drv: radeonsi device: 3 drv: swrast surfaceless:
    drv: nvidia wayland: drv: radeonsi x11: drv: radeonsi
    inactive: gbm,device-2
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: amd mesa v: 24.0.8 glx-v: 1.4
    direct-render: yes renderer: AMD Radeon 660M (radeonsi rembrandt LLVM
    18.1.6 DRM 3.57 6.9.5-1-default) device-ID: 1002:1681 memory: 500 MiB
    unified: no display-ID: :0.0
  API: Vulkan v: 1.3.283 layers: 2 device: 0 type: integrated-gpu name: AMD
    Radeon 660M (RADV REMBRANDT) driver: N/A device-ID: 1002:1681
    surfaces: xcb,xlib,wayland device: 1 type: discrete-gpu name: NVIDIA
    GeForce RTX 3050 Laptop GPU driver: N/A device-ID: 10de:25a2
    surfaces: xcb,xlib,wayland

@JoseskVolpe and the reason for this kernel option entry pcie_aspm=force? This can, if hardware that does not support ASPM can cause the system to stop responding…

I had to use this option to make display brightness adjustment work

@JoseskVolpe acer_wmi should look after that or the acpi backlight… or the amd one is interfering…