NVIDIA GPU: stability issues

The GPU is unstable.
Immediately after turning on the machine, the GPU is functional. It becomes unusable after a few minutes. Finally, after going into deep sleep mode, an error message appears.

I can no longer restart the machine, and it is impossible to shut it down via the graphical interface.

I ran the script: nvidia-bug-report.sh

Here are some excerpts:

Component                           | Details
================================================================================
Vulkan Info                         | None
--------------------------------------------------------------------------------
NVIDIA SMI                          | NVIDIA-SMI version  : 580.105.08
                                    | NVML version        : 580.105
                                    | DRIVER version      : 580.105.08
                                    | CUDA Version        : 13.0
--------------------------------------------------------------------------------
NVIDIA GPU Details                  | NVIDIA GeForce RTX 3080 Laptop GPU, 580.105.08, 8192 MiB, 94.04.43.00.9F, 00000000:01:00.0, [N/A]
--------------------------------------------------------------------------------
NVIDIA Settings                     | None
--------------------------------------------------------------------------------
NVIDIA Fabric Manager               | None
--------------------------------------------------------------------------------
NVIDIA Subnet Manager               | None
--------------------------------------------------------------------------------
Mellanox Link                       | None
--------------------------------------------------------------------------------
InfiniBand Status                   | None
--------------------------------------------------------------------------------
InfiniBand Network Discovery        | None
--------------------------------------------------------------------------------
NVIDIA MSE/NETIR Versions           | None
--------------------------------------------------------------------------------
NVIDIA Switch Details               | mst command not found
--------------------------------------------------------------------------------
NVIDIA NIC Details                  | None
--------------------------------------------------------------------------------
OS Details                          | None
--------------------------------------------------------------------------------

.........


*** /etc/os-release
*** ls: lrwxrwxrwx. 1 root root 21 2025-11-13 20:31:10.000000000 +0100 /etc/os-release -> ../usr/lib/os-release
NAME="openSUSE Tumbleweed"
# VERSION="20251113"
ID="opensuse-tumbleweed"
ID_LIKE="opensuse suse"
VERSION_ID="20251113"
PRETTY_NAME="openSUSE Tumbleweed"
ANSI_COLOR="0;32"
# CPE 2.3 format, boo#1217921
CPE_NAME="cpe:2.3:o:opensuse:tumbleweed:20251113:*:*:*:*:*:*:*"
#CPE 2.2 format
#CPE_NAME="cpe:/o:opensuse:tumbleweed:20251113"
BUG_REPORT_URL="https://bugzilla.opensuse.org"
SUPPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Tumbleweed"
LOGO="distributor-logo-Tumbleweed"

........

● nvidia-powerd.service - nvidia-powerd service
     Loaded: loaded (/usr/lib/systemd/system/nvidia-powerd.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-11-17 20:38:34 CET; 18min ago
 Invocation: 082b14f58834406890ac60c028e3acc5
   Main PID: 1030 (nvidia-powerd)
      Tasks: 5 (limit: 37402)
        CPU: 998ms
     CGroup: /system.slice/nvidia-powerd.service
             └─1030 /usr/bin/nvidia-powerd

nov. 17 20:38:35 dans nvidia-powerd[1030]: DBus Connection is established
nov. 17 20:38:35 dans nvidia-powerd[1030]: ERROR! DC power limits table is not supported
nov. 17 20:38:36 dans nvidia-powerd[1030]: ERROR! Failed to get SysPwrLimitGetInfo!!
nov. 17 20:38:36 dans nvidia-powerd[1030]: ERROR! Client (presumably SBIOS) has requested to disable Dynamic Boost DC controller
nov. 17 20:38:47 dans nvidia-powerd[1030]: ERROR! Exception NvPcfApi is not initialized
nov. 17 20:38:56 dans nvidia-powerd[1030]: ERROR! Exception NvPcfApi is not initialized
nov. 17 20:39:16 dans nvidia-powerd[1030]: ERROR! Exception NvPcfApi is not initialized
nov. 17 20:41:20 dans nvidia-powerd[1030]: ERROR! Exception NvPcfApi is not initialized
nov. 17 20:55:13 dans nvidia-powerd[1030]: ERROR! Exception NvPcfApi is not initialized
nov. 17 20:56:36 dans nvidia-powerd[1030]: ERROR! Exception NvPcfApi is not initialized

○ nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; disabled; preset: enabled)
     Active: inactive (dead)

.........

nov. 16 10:50:04 dans kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 511
nov. 16 10:50:04 dans kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  580.105.08  Wed Oct 29 23:15:11 UTC 2025
nov. 16 10:50:05 dans kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  580.105.08  Wed Oct 29 22:15:26 UTC 2025
nov. 16 10:50:05 dans kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
nov. 16 10:50:05 dans systemd[1]: Started nvidia-powerd service.
nov. 16 10:50:05 dans nvidia-powerd[1020]: nvidia-powerd version:2.0 (build 1)
nov. 16 10:50:07 dans nvidia-powerd[1020]: DBus Connection is established
nov. 16 10:50:07 dans nvidia-powerd[1020]: ERROR! DC power limits table is not supported
nov. 16 10:50:07 dans nvidia-powerd[1020]: ERROR! Failed to get SysPwrLimitGetInfo!!
nov. 16 10:50:07 dans nvidia-powerd[1020]: ERROR! Client (presumably SBIOS) has requested to disable Dynamic Boost DC controller
nov. 16 10:50:07 dans kernel: [drm] Initialized nvidia-drm 0.0.0 for 0000:01:00.0 on minor 0
nov. 16 10:50:13 dans gnome-shell[1763]: Added device '/dev/dri/card0' (nvidia-drm) using atomic mode setting.
nov. 16 10:51:32 dans gnome-shell[3592]: Added device '/dev/dri/card0' (nvidia-drm) using atomic mode setting.
nov. 16 10:57:46 dans kernel: NVRM: GPU at PCI:0000:01:00: GPU-622a4aae-0147-82a2-9605-6c976230a1ee

.........

nov. 16 10:57:52 dans kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=8551, name=python, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 2943 (0x20800a81 0x4).
nov. 16 10:57:58 dans nvidia-powerd[1020]: ERROR! Failed to get AC Line status
nov. 16 10:57:58 dans kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=1020, name=nvidia-powerd, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 2944 (0x2080205a 0x4).
nov. 16 10:57:58 dans kernel: NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
nov. 16 11:18:13 dans suspend[11143]: nvidia-suspend.service
nov. 16 11:18:13 dans logger[11143]: <13>Nov 16 11:18:13 suspend: nvidia-suspend.service
nov. 16 11:18:18 dans kernel: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:2:0:4040


Any Idea?

D.

That looks to me as sort of a thermal problem (stuck fan, improperly seated heat sink …)

That is a typical consequence of nvidia.persistenced being inactive, maybe there is some boot parameter needed by your GPU, or there might be something interesting in the README file.
A normal output should be something like:

LT-B:~ # systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-11-17 10:06:21 CET; 12h ago
 Invocation: 1817a275a0514ad1b1361d033c93701b
    Process: 1642 ExecStart=/usr/bin/nvidia-persistenced --verbose (code=exited, status=0/SUCCESS)
   Main PID: 1663 (nvidia-persiste)
      Tasks: 1 (limit: 18878)
        CPU: 11ms
     CGroup: /system.slice/nvidia-persistenced.service
             └─1663 /usr/bin/nvidia-persistenced --verbose

nov 17 10:06:21 LT-B systemd[1]: Starting NVIDIA Persistence Daemon...
nov 17 10:06:21 LT-B nvidia-persistenced[1663]: Verbose syslog connection opened
nov 17 10:06:21 LT-B nvidia-persistenced[1663]: Directory /var/run/nvidia-persistenced will not be removed on exit
nov 17 10:06:21 LT-B nvidia-persistenced[1663]: Started (1663)
nov 17 10:06:21 LT-B nvidia-persistenced[1663]: device 0000:01:00.0 - registered
nov 17 10:06:21 LT-B nvidia-persistenced[1663]: device 0000:01:00.0 - persistence mode enabled.
nov 17 10:06:21 LT-B nvidia-persistenced[1663]: device 0000:01:00.0 - NUMA memory onlined.
nov 17 10:06:21 LT-B nvidia-persistenced[1663]: Local RPC services initialized
nov 17 10:06:21 LT-B systemd[1]: Started NVIDIA Persistence Daemon.
LT-B:~ #

See also the last few lines of the power management chapter.

Thanks for the reply.
Unfortunately, I don’t have the level of knowledge to understand everything.

However, I did check a few things:

  • My machine has two fans, both of which are working.

  • I have now updated the OS, and the MOKs have been enrolled.

  • When I restart, nvidia-smi gives me the following response

nvidia-smi
Fri Nov 21 14:52:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8              9W /   80W |      13MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3455      G   /usr/bin/gnome-shell                      3MiB |
+-----------------------------------------------------------------------------------------+
  • When I restart, nvidia-smi gives me the following response:
    After attempting to run a Python script that is functional, I see that the GPU is non longer working

workon OCR_openCV_2
switcherooctl launch -g 1 python OCR_openCV_2_tst.py

nvidia-smi gives me the following response

nvidia-smi
Fri Nov 21 15:01:35 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080 ...    Off |   00000000:01:00.0 N/A |                  N/A |
|ERR!  ERR! ERR!             N/A  /  N/A  |      13MiB /   8192MiB |     N/A      Default |
|                                         |                        |                 ERR! |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I tried the diagnostic command,

sudo sh nvidia-bug-report.sh

but the response is too large to fit in the chat.

What bothers me is that the fan speed doesn’t increase when I try to use the GPU.

OK, so after cold booting the Nvidia GPU works. You can run nvtop and look for anomalies (e.g. temperature, spikes in usage, abnormal processes showing up in the list …). You may need to install it by zypper in nvtop if you don’t find it.
Please use the system “normally” for some time, not with that python script (about which we know nothing…).
If the system works properly for, say, 30 minutes or so, maybe there is something odd in the mentioned script, which might overload the system, interfere with the system or whatever.

@DadoooR3 You have checked the system power supply and the GPU power connection?

Thank you for your interest in my concerns.

@ OrsoBruno

nvtop is now running
I’ll observe what happens when using the PC normally.

After 5 minutes, Device 0 (NVIDIA) is functional and Device 1 (AMD) is also running. The machine is currently running on battery power.

I’ll give you feedback in 1 or 2 days.

@ MalcolmLewis

I can see the usefulness of the approach, but I don’t know how to do it.

Are there any diagnostic commands?

@DadoooR3 unless your sensors can show (sensors-detect --auto) the power information, perhaps the system BIOS, else a Power Supply tester or multi meter.

Ahh Nvidia/AMD so could be one or the other fighting for control…

@ malcolmlewis

Yes, the world of GPUs is a delicate one.
I am very grateful to you for looking into my problem.

I analyzed a few things:

When working on battery power:

I can:

Shut down my machine or restart it from the graphical interface.

Launch applications with the dedicated graphics card (right-click, etc.).

Launch my Python script with switcherooctl without any problems or crashes
(there was another small problem due to my carelessness:
to run OPENCV under WAYLAND, I had to add three lines of code:

# at the beginning of the script:
# Handling a WAYLAND <-> X11 display problem
os.environ[“XDG_SESSION_TYPE”] = “xcb” 
# at the end of the script
os.environ[“XDG_SESSION_TYPE”] = “wayland”

Otherwise, I wouldn’t be able to work on the image…
I had omitted the last line…)

In short, everything works here.

When I work with the charger plugged in.

Shutting down or restarting my machine does not work (black screen without shutdown and without cursor)

Launching applications with the dedicated graphics card (right-click; …) works until…

I launch the Python script with switcherooctl. From that moment on, the indicators in NVTOP no longer show anything, as if there were a loss of connection between the graphics card and nvtop.

Other observations:
with NVTOP, I can see that the Nvidia graphics card has no fan, but the temperature is within 1°C of the AMD processor temperature.

I am still looking into power supply issues… I need to find something to measure…

@ OrsoBruno

After reading the readme file, I tried the command
nvidia-persistenced --verbose

The answer is

nvidia-persistenced failed to initialize. Check syslog for more details

I can’t find Syslog.

Where can I find the information?

Disregard that info. On openSUSE nvidia-persistenced is run as a systemd service:

LT-B:~ # nvidia-persistenced --verbose
nvidia-persistenced failed to initialize. Check syslog for more details.
LT-B:~ # systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-11-25 09:32:28 CET; 4min 49s ago
 Invocation: 3612879bcea640eba24f8d3ca6962ead
    Process: 1666 ExecStart=/usr/bin/nvidia-persistenced --verbose (code=exited, status=0/SUCCESS)
   Main PID: 1680 (nvidia-persiste)
      Tasks: 1 (limit: 18878)
        CPU: 14ms
     CGroup: /system.slice/nvidia-persistenced.service
             └─1680 /usr/bin/nvidia-persistenced --verbose

nov 25 09:32:28 LT-B systemd[1]: Starting NVIDIA Persistence Daemon...
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: Verbose syslog connection opened
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: Directory /var/run/nvidia-persistenced will not be removed on exit
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: Started (1680)
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: device 0000:01:00.0 - registered
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: device 0000:01:00.0 - persistence mode enabled.
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: device 0000:01:00.0 - NUMA memory onlined.
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: Local RPC services initialized
nov 25 09:32:28 LT-B systemd[1]: Started NVIDIA Persistence Daemon.
LT-B:~ #

Anyway the relevant system log can be accessed also by:

LT-B:~ # journalctl -b --unit nvidia-persistenced
nov 25 09:32:28 LT-B systemd[1]: Starting NVIDIA Persistence Daemon...
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: Verbose syslog connection opened
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: Directory /var/run/nvidia-persistenced will not be removed on exit
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: Started (1680)
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: device 0000:01:00.0 - registered
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: device 0000:01:00.0 - persistence mode enabled.
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: device 0000:01:00.0 - NUMA memory onlined.
nov 25 09:32:28 LT-B nvidia-persistenced[1680]: Local RPC services initialized
nov 25 09:32:28 LT-B systemd[1]: Started NVIDIA Persistence Daemon.
LT-B:~ #

Remove the “-b” option to see also results for previous boots.

With the command

sudo journalctl -b --unit nvidia-persistenced
or
sudo journalctl --unit nvidia-persistenced
i get
-- No entries --

and in the service manager i have

Is this normal?

No, as you see the nvidia-persistenced service is reported as inactive(dead) [Inactif(Mort)].
In a properly working system the service should be started on boot [Au démarrage du système] and not manually [Manuellement] so something appears to be wrong, unless you willingly configured the system that way.

The following services are now enabled at startup

The situation is improving.

When waking from sleep mode, GPU is now available.

When battery is charging, problems with shutdown an restart remain. Hibernation causes freeze with the failure message

ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:2:0:4040

Now I have updated the system.

MOK enrollment.

Now I can restart and shut down the machine whether it is plugged in or running on battery power.

Thank you for your perseverance.

Next, I will test the GPU functionality again.

Unfortunately, using the GPU is still difficult.
I am currently preparing a new installation of the machine.
It is preferable that contributors’ energy be invested in other issues.
Thank you for your help.