I have two workstations, both were running Leap 15.1. I in-place upgraded the first workstation to Leap 15.2, tested that CUDA was still functional (via Blender 2.83), and used this success as justification to perform a clean install of Leap 15.2 on my primary workstation. However, CUDA did not work after a clean install. Blender could not identify any CUDA capable devices, despite all other tests succeeding.
I use Ansible to configure my workstations, so I know for certainty that the configuration was consistent.
NVIDIA-SMI output:
+-----------------------------------------------------------------------------+| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. ||===============================+======================+======================|| 0 GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A || 29% 29C P8 23W / 225W | 774MiB / 7974MiB | 2% Default |+-------------------------------+----------------------+----------------------+
NVCC output:
~> /usr/local/cuda/bin/nvcc --versionnvcc: NVIDIA (R) Cuda compiler driverCopyright (c) 2005-2019 NVIDIA CorporationBuilt on Fri_Feb__8_19:08:17_PST_2019Cuda compilation tools, release 10.1, V10.1.105
Confirmed NVIDIA drivers were installed fine. Even played some Portal 2.
Running Blender with the --debug-cycles flag produced this output:
I0707 00:58:13.995280 11815 blender_python.cpp:191] Debug flags initialized to:CPU flags: AVX2 : True AVX : True SSE4.1 : True SSE3 : True SSE2 : True BVH layout : BVH8 Split : FalseCUDA flags: Adaptive Compile : FalseOptiX flags: CUDA streams : 1OpenCL flags: Device type : ALL Debug : False Memory limit : 0...I0707 00:59:02.191576 11815 device_cuda.cpp:41] CUEW initialization succeededI0707 00:59:02.397126 11815 device_cuda.cpp:43] Found precompiled kernelsCUDA cuInit: Unknown errorI0707 00:59:03.931155 11815 device_opencl.cpp:48] CLEW initialization succeeded.
Saw the same results in Blender 2.82 and 2.9 nightly. Again, these same tests worked FINE on the workstation that was in-place upgraded. And it rendered fine. But not on the workstation that was clean installed.
Searching found me this discussion. Following the advice of those posters, I compiled the sample code and got the same error as them running the deviceQuery code sample:
./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking)cudaGetDeviceCount returned 999-> unknown errorResult = FAIL
But this is where it gets weird. If deviceQuery is ran as an elevated user ONCE, CUDA starts working correctly for non-elevated users until the next reboot.
sudo ./deviceQuery./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking)Detected 1 CUDA Capable device(s)Device 0: "GeForce RTX 2070" CUDA Driver Version / Runtime Version 10.2 / 10.1 CUDA Capability Major/Minor version number: 7.5 Total amount of global memory: 7974 MBytes (8361672704 bytes) (36) Multiprocessors, ( 64) CUDA Cores/MP: 2304 CUDA Cores GPU Max Clock rate: 1815 MHz (1.81 GHz) Memory Clock rate: 7001 Mhz Memory Bus Width: 256-bit L2 Cache Size: 4194304 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1024 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 3 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.1, NumDevs = 1Result = PASS
I experimented in disabling AppArmor, ensured my user was a member of the video group (and rebooted), upgraded to Cuda 10.2. But I can’t explain why running this as root makes it work for all users, or why this is occurring on a clean install but not an upgraded install.
Any thoughts on how I can debug this further? I’ve got the workaround, but would prefer a solution.