CUDA 10.1 Issues, and workaround

I have two workstations, both were running Leap 15.1. I in-place upgraded the first workstation to Leap 15.2, tested that CUDA was still functional (via Blender 2.83), and used this success as justification to perform a clean install of Leap 15.2 on my primary workstation. However, CUDA did not work after a clean install. Blender could not identify any CUDA capable devices, despite all other tests succeeding.

I use Ansible to configure my workstations, so I know for certainty that the configuration was consistent.

NVIDIA-SMI output:

+-----------------------------------------------------------------------------+| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     ||-------------------------------+----------------------+----------------------+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC || Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. ||===============================+======================+======================||   0  GeForce RTX 2070    Off  | 00000000:01:00.0  On |                  N/A || 29%   29C    P8    23W / 225W |    774MiB /  7974MiB |      2%      Default |+-------------------------------+----------------------+----------------------+

NVCC output:

~> /usr/local/cuda/bin/nvcc --versionnvcc: NVIDIA (R) Cuda compiler driverCopyright (c) 2005-2019 NVIDIA CorporationBuilt on Fri_Feb__8_19:08:17_PST_2019Cuda compilation tools, release 10.1, V10.1.105

Confirmed NVIDIA drivers were installed fine. Even played some Portal 2.

Running Blender with the --debug-cycles flag produced this output:

I0707 00:58:13.995280 11815 blender_python.cpp:191] Debug flags initialized to:CPU flags:  AVX2       : True  AVX        : True  SSE4.1     : True  SSE3       : True  SSE2       : True  BVH layout : BVH8  Split      : FalseCUDA flags:  Adaptive Compile : FalseOptiX flags:  CUDA streams : 1OpenCL flags:  Device type    : ALL  Debug          : False  Memory limit   : 0...I0707 00:59:02.191576 11815 device_cuda.cpp:41] CUEW initialization succeededI0707 00:59:02.397126 11815 device_cuda.cpp:43] Found precompiled kernelsCUDA cuInit: Unknown errorI0707 00:59:03.931155 11815 device_opencl.cpp:48] CLEW initialization succeeded.

Saw the same results in Blender 2.82 and 2.9 nightly. Again, these same tests worked FINE on the workstation that was in-place upgraded. And it rendered fine. But not on the workstation that was clean installed.

Searching found me this discussion. Following the advice of those posters, I compiled the sample code and got the same error as them running the deviceQuery code sample:

./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking)cudaGetDeviceCount returned 999-> unknown errorResult = FAIL

But this is where it gets weird. If deviceQuery is ran as an elevated user ONCE, CUDA starts working correctly for non-elevated users until the next reboot.

sudo ./deviceQuery./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking)Detected 1 CUDA Capable device(s)Device 0: "GeForce RTX 2070"  CUDA Driver Version / Runtime Version          10.2 / 10.1  CUDA Capability Major/Minor version number:    7.5  Total amount of global memory:                 7974 MBytes (8361672704 bytes)  (36) Multiprocessors, ( 64) CUDA Cores/MP:     2304 CUDA Cores  GPU Max Clock rate:                            1815 MHz (1.81 GHz)  Memory Clock rate:                             7001 Mhz  Memory Bus Width:                              256-bit  L2 Cache Size:                                 4194304 bytes  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers  Total amount of constant memory:               65536 bytes  Total amount of shared memory per block:       49152 bytes  Total number of registers available per block: 65536  Warp size:                                     32  Maximum number of threads per multiprocessor:  1024  Maximum number of threads per block:           1024  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)  Maximum memory pitch:                          2147483647 bytes  Texture alignment:                             512 bytes  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)  Run time limit on kernels:                     Yes  Integrated GPU sharing Host Memory:            No  Support host page-locked memory mapping:       Yes  Alignment requirement for Surfaces:            Yes  Device has ECC support:                        Disabled  Device supports Unified Addressing (UVA):      Yes  Device supports Compute Preemption:            Yes  Supports Cooperative Kernel Launch:            Yes  Supports MultiDevice Co-op Kernel Launch:      Yes  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0  Compute Mode:     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.1, NumDevs = 1Result = PASS

I experimented in disabling AppArmor, ensured my user was a member of the video group (and rebooted), upgraded to Cuda 10.2. But I can’t explain why running this as root makes it work for all users, or why this is occurring on a clean install but not an upgraded install.

Any thoughts on how I can debug this further? I’ve got the workaround, but would prefer a solution.

Hi and welcome to the Forum :slight_smile:
Upgrade to cuda 11, it has the later driver (as in 450.x not 440.x series), kernel and gcc support…

CUDA 11 is a RC.

To OP: try to reinstall Nvidia drivers and then CUDA.
Upgrade process recommends uninstalling proprietary drivers before performing upgrade.

Hi
Are you using cuda? The download for openSUSE is the 11 version…


nvidia-smi 

Tue Jul  7 09:22:40 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51       Driver Version: 450.51       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GT 710      Off  | 00000000:00:03.0 N/A |                  N/A |
| 40%   44C    P8    N/A /  N/A |     98MiB /   978MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_May__6_19:09:25_PDT_2020
Cuda compilation tools, release 11.0, V11.0.167
Build cuda_11.0_bu.TC445_37.28358933_0

/data/applications/cuda/cuda-samples-master/Samples/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 710"
  CUDA Driver Version / Runtime Version          11.0 / 11.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 979 MBytes (1026490368 bytes)
  ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
  GPU Max Clock rate:                            954 MHz (0.95 GHz)
  Memory Clock rate:                             800 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 3
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS

cat /etc/os-release 

NAME="openSUSE Leap"
VERSION="15.2"
ID="opensuse-leap"
ID_LIKE="suse opensuse"
VERSION_ID="15.2"
PRETTY_NAME="openSUSE Leap 15.2"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:15.2"
BUG_REPORT_URL="https://bugs.opensuse.org"

https://forums.opensuse.org/attachment.php?attachmentid=902&stc=1

Blender support for CUDA 11 is still quite preliminary, 10.1 and 10.2 still the recommended versions.

This seems to be a modprobe issue with nvidia-uvm and nvidia-uvm-tools. More information here

Hi
Looks like you and the bug report have it sorted…

My workflow for nvidia has always been the hard way, the lag with Leap and Tumbleweed (primary desktop) is slow, for SLES and SLED I use the repos though, since I still have older card support (no cuda needed). 11 has better gcc support as well…