Cannot get CUDA to work

malcolmlewis · June 25, 2024, 1:56am

@JoseskVolpe If you look in ls /sys/class/backlight/ what items are present? amd?

JoseskVolpe · June 25, 2024, 2:04am

I remember using acer_wmi but it was not working, but i’m not sure, i’ll try that again.

amdgpu_bl1

malcolmlewis · June 25, 2024, 2:11am

@JoseskVolpe So that should be working… So can you look at the output from journalctl -b | grep backlight is there an amd and an acpi one running, if so, disable the acpi service one and mask and remove those kernel options.

malcolmlewis · June 25, 2024, 2:14am

@JoseskVolpe something like systemd-backlight@backlight:acpi_video1.service: Main process exited, code=exited, status=1/FAILURE I see that on my AMD Laptop…

JoseskVolpe · June 25, 2024, 2:30am

I removed that ‘force’ option and display brightness adjustment is still working (weird, because it wasn’t before adding that option). CUDA is still not working.

malcolmlewis · June 25, 2024, 2:36am

@JoseskVolpe Check the journal for backlight as you should not need the other entry either.

Maybe clean out and blender cache in $HOME?

JoseskVolpe · June 25, 2024, 2:48am

Journal was telling amdgpu_bl1 is being loaded. I removed the acpi_backlight=native entry, but it has broken the display brightness adjustment. Journal shows there was an error with nvidia-wmi-ec-backlight.

jun 24 20:45:13 ProtoFOX kernel: amdgpu 0000:75:00.0: amdgpu: [drm] Skipping amdgpu DM backlight registration
jun 24 23:45:15 ProtoFOX kernel: nvidia-wmi-ec-backlight 603E9613-EF25-4338-A3D0-C46177516DB7: EC backlight control failed: AE_NOT_FOUND
jun 24 23:45:15 ProtoFOX kernel: nvidia-wmi-ec-backlight 603E9613-EF25-4338-A3D0-C46177516DB7: probe with driver nvidia-wmi-ec-backlight failed with error -5

I put that parameter back again for now.

malcolmlewis · June 25, 2024, 3:24am

@JoseskVolpe So do you have any /etc/modprobe.d options files for nvidia or amd present here? The only other thing you could try by editing grub options at boot and adding fbdev=1 nvidia_drm.modeset=1 to see if that helps. Press the e key, arrow down to linux (or linuxefi) and press end and add the options above and press F10 to boot. If it doesn’t work (as in screen doesn’t show up` reboot as the options are only added at that boot…

JoseskVolpe · June 25, 2024, 11:13pm

Yes, only NVIDIA

$ cat /etc/modprobe.d/nvidia.conf 
# Automatically generated by EnvyControl

options nvidia-drm modeset=1
##Enable NVIDIA GSP Firmware
options nvidia NVreg_EnableGpuFirmware=1
##Power Management
options nvidia "NVreg_DynamicPowerManagement=0x02"
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_UsePageAttributeTable=1 NVreg_InitializeSystemMemoryAllocations=0

$ cat /etc/modprobe.d/50-nvidia-default.conf.rpmsave 
# NVreg_PreserveVideoMemoryAllocations needed for GNOME Wayland
options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=485 NVreg_DeviceFileMode=0660 NVreg_PreserveVideoMemoryAllocations=1
options nvidia-drm modeset=1 fbdev=1
install nvidia PATH=$PATH:/bin:/usr/bin; if /sbin/modprobe --ignore-install nvidia; then   if /sbin/modprobe nvidia_uvm; then     if [ ! -c /dev/nvidia-uvm ]; then       mknod -m 660 /dev/nvidia-uvm c $(cat /proc/devices | while read major device; do if [ "$device" = "nvidia-uvm" ]; then echo $major; break; fi ; done) 0;        chown :video /dev/nvidia-uvm;     fi;     if [ ! -c /dev/nvidia-uvm-tools ]; then       mknod -m 660 /dev/nvidia-uvm-tools c $(cat /proc/devices | while read major device; do if [ "$device" = "nvidia-uvm" ]; then echo $major; break; fi ; done) 1;       chown :video /dev/nvidia-uvm-tools;     fi;   fi;   if [ ! -c /dev/nvidiactl ]; then     mknod -m 660 /dev/nvidiactl c 195 255;     chown :video /dev/nvidiactl;   fi;   devid=-1;   for dev in $(ls -d /sys/bus/pci/devices/*); do      vendorid=$(cat $dev/vendor);     if [ "$vendorid" = "0x10de" ]; then       class=$(cat $dev/class);       classid=${class%%00};       if [ "$classid" = "0x0300" -o "$classid" = "0x0302" ]; then          devid=$((devid+1));         if [ ! -c /dev/nvidia${devid} ]; then            mknod -m 660 /dev/nvidia${devid} c 195 ${devid};            chown :video /dev/nvidia${devid};         fi;       fi;     fi;   done;   /sbin/modprobe nvidia_drm;   if [ ! -c /dev/nvidia-modeset ]; then     mknod -m 660 /dev/nvidia-modeset c 195 254;     chown :video /dev/nvidia-modeset;   fi; fi

The system did have booted but there was no change, CUDA still didn’t worked…

I’ve been requested on the NVIDIA forum to disable GSP firmware, CUDA did worked but crashed after using Blender, it also have me scary side-effects like system taking longer to boot, WiFi not working and then mouse and WiFi not working after reboot until powering off. ¿Have i bought a haunted hardware? lol

malcolmlewis · June 26, 2024, 12:34am

@JoseskVolpe So what is envy control that generated the nvidia.conf?

So I would remove both for the moment and create your own 50-nvidia.conf with;

options nvidia-drm modeset=1
options nvidia "NVreg_DynamicPowerManagement=0x02"

To start with, after creation run dracut -f --regenerate-all and reboot and test.

For reference I have;

cat /etc/modprobe.d/50-nvidia.conf 

blacklist nouveau
options nouveau modeset=0
##Enable NVIDIA Open Kernel Driver
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1
##Enable NVIDIA GSP Firmware
options nvidia NVreg_EnableGpuFirmware=1
##Power Management
options nvidia NVreg_DynamicPowerManagement=0x02
options nvidia NVreg_PreserveVideoMemoryAllocations=1
## Enable the PAT feature
options nvidia NVreg_UsePageAttributeTable=1

JoseskVolpe · June 26, 2024, 9:32pm

¡It works! Using only the first 2 given parameters. CUDA won’t crash anymore.
Blender works fine (OptiX still crashes just like as on Windows, too), Tensorflow now detects CUDA but gives a different error

>>> tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2024-06-26 16:40:52.222668: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-06-26 16:40:52.643438: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
False
>>> tf.config.list_physical_devices('GPU')
2024-06-26 16:41:12.613626: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-06-26 16:41:12.614151: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]

Ollama now finds my GPU and stores models partially in the VRAM, but still uses CPU and gives the warning:

WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="140546558334784" timestamp=1719435271

Suspend doesn’t seems to freeze anymore, but i would need some days to confirm.

JoseskVolpe · June 26, 2024, 9:37pm

I don’t know lol
I guess it was one of the others Optimus managers i tried before using Prime, like Bumblebee. It’s my first time with a dedicated GPU (tbh, my real first time was 2 decades back with a Windows XP machine, but i was a cub so it doesn’t counts xP) so i still needed to learn some things about it. Probably one of them generated it.

Edit: It’s this: GitHub - bayasdev/envycontrol: Easy GPU switching for Nvidia Optimus laptops under Linux

malcolmlewis · June 26, 2024, 9:56pm

@JoseskVolpe How are you running/starting ollama? If from the command line, then I would recommend switcherooctl and the associated service switcheroo-control.service to use Prime Render Offload. It’s what I use here, but I also have a dedicated computer running k3s with a Nvidia Tesla P4 as a compute node for the likes of open-webui/ollama

 switcherooctl list
Device: 0
  Name:        Intel Corporation DG2 [Arc A380]
  Default:     yes
  Environment: DRI_PRIME=pci-0000_04_00_0

Device: 1
  Name:        NVIDIA Corporation TU117GLM [Quadro T400 Mobile]
  Default:     no
  Environment: __GLX_VENDOR_LIBRARY_NAME=nvidia __NV_PRIME_RENDER_OFFLOAD=1 __VK_LAYER_NV_optimus=NVIDIA_only

JoseskVolpe · June 26, 2024, 10:02pm

I’m using the terminal, using switcherooctl has the same results. It first tries to use the AMD iGPU, skips, then offloads to the dGPU, but still uses CPU to compute (nvtop stays on 0% usage for the NVIDIA GPU).

time=2024-06-26T18:58:19.577-03:00 level=WARN source=amd_linux.go:48 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-06-26T18:58:19.577-03:00 level=INFO source=amd_linux.go:233 msg="unsupported Radeon iGPU detected skipping" id=0 total="512.0 MiB"
time=2024-06-26T18:58:19.577-03:00 level=INFO source=amd_linux.go:311 msg="no compatible amdgpu devices detected"
time=2024-06-26T18:58:19.850-03:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=14 memory.available="3.7 GiB" memory.required.full="6.8 GiB" memory.required.partial="3.6 GiB" memory.required.kv="1.0 GiB" memory.weights.total="4.7 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB"

malcolmlewis · June 26, 2024, 10:15pm

@JoseskVolpe can you show the output from;

switcherooctl list
switcherooctl glxinfo | grep "OpenGL renderer"

JoseskVolpe · June 26, 2024, 10:18pm

 (⌚qua jun-6 6:58:57)-(🦊joseskvolpe:~)-( 308K:63)
$ switcherooctl list
Device: 0
  Name:        Advanced Micro Devices, Inc. [AMD®/ATI] Rembrandt [Radeon 680M]
  Default:     yes
  Environment: DRI_PRIME=pci-0000_75_00_0

Device: 1
  Name:        NVIDIA Corporation GA107M [GeForce RTX 3050 Mobile]
  Default:     no
  Environment: __GLX_VENDOR_LIBRARY_NAME=nvidia __NV_PRIME_RENDER_OFFLOAD=1 __VK_LAYER_NV_optimus=NVIDIA_only

 (⌚qua jun-6 7:16:56)-(🦊joseskvolpe:~)-( 308K:63)
$ switcherooctl glxinfo | grep "OpenGL renderer"
OpenGL renderer string: NVIDIA GeForce RTX 3050 Laptop GPU/PCIe/SSE2

malcolmlewis · June 26, 2024, 10:20pm

@JoseskVolpe so if you run;
__GLX_VENDOR_LIBRARY_NAME=nvidia __NV_PRIME_RENDER_OFFLOAD=1 __VK_LAYER_NV_optimus=NVIDIA_only /usr/bin/ollama

Assuming that’s where ollama lives, does this help?

JoseskVolpe · June 26, 2024, 10:24pm

Nope, same results. I think it may be the same issue that Tensorflow is having.

malcolmlewis · June 26, 2024, 10:29pm

@JoseskVolpe not really sure, just installed ollama as a test here;

ollama serve
....
time=2024-06-26T17:22:29.413-05:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-41dbedb2-a46d-4f41-4ac3-8a2dbdfea5b6 library=cuda compute=7.5 driver=12.5 name="NVIDIA T400" total="1.6 GiB" available="1.6 GiB"

JoseskVolpe · June 26, 2024, 10:32pm

I’ll try to fix Tensorflow, if i get into some issues i’ll open a new thread. I think i can put this one as solved now. ¡Thanks!