bbswitch cannot power down nvidia dGPU: device is in use by driver 'nvidia', refusing OFF

Hi

I am running an XMG Neo 15 (2019) = XNE15M19 = Tongfang GK5CP0Z with an “Intel Corporation UHD Graphics 630 (Mobile)” and a dGPU “NVIDIA Corporation TU106M [GeForce RTX 2060 Mobile] (rev a1)”.
Here’'s lspci:

00:00.0 Host bridge: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 07)
00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 630 (Mobile)
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:14.3 Network controller: Intel Corporation Wireless-AC 9560 [Jefferson Peak] (rev 10)
00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller (rev 10)
00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 (rev f0)
00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0)
00:1d.5 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #14 (rev f0)
00:1e.0 Communication controller: Intel Corporation Device a328 (rev 10)
00:1f.0 ISA bridge: Intel Corporation Device a30d (rev 10)
00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
01:00.0 VGA compatible controller: NVIDIA Corporation TU106M [GeForce RTX 2060 Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation TU106 High Definition Audio Controller (rev a1)
01:00.2 USB controller: NVIDIA Corporation TU106 USB 3.1 Host Controller (rev a1)
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU106 USB Type-C Port Policy Controller (rev a1)
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

I had been using prime-select to switch off the nvidia card, but this turned out to only use intel for rendering - but keep the nvidia card powered.
So I installed bbswitch.
The card state output is unfortunately always the same:

# cat /proc/acpi/bbswitch 
0000:01:00.0 ON

Writing a value has no effect:

# tee /proc/acpi/bbswitch <<<OFF && cat /proc/acpi/bbswitch
OFF
0000:01:00.0 ON

When writing to /proc/acpi/bbswitch, journalctl reports:

kernel: bbswitch: device 0000:01:00.0 is in use by driver 'nvidia', refusing OFF

Indeed nvidia is loaded:

# lsmod | grep nvidia
i2c_nvidia_gpu         16384  0
nvidia              18825216  9
ipmi_msghandler        65536  2 ipmi_devintf,nvidia

(This is after attempting a prime-select intel, which successfully unloaded nvidia_drm and nvidia_uvm)

Apparently, X server is holding nvidia open:

# lsof | grep /dev/nvidia
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
      Output information may be incomplete.
X         2183                      root   15u      CHR            195,255        0t0        151 /dev/nvidiactl
X         2183                      root   18u      CHR              195,0        0t0      18487 /dev/nvidia0
X         2183                      root   19u      CHR              195,0        0t0      18487 /dev/nvidia0
X         2183 2249 X:disk$0        root   15u      CHR            195,255        0t0        151 /dev/nvidiactl
X         2183 2249 X:disk$0        root   18u      CHR              195,0        0t0      18487 /dev/nvidia0
X         2183 2249 X:disk$0        root   19u      CHR              195,0        0t0      18487 /dev/nvidia0
X         2183 2334 X:disk$0        root   15u      CHR            195,255        0t0        151 /dev/nvidiactl
X         2183 2334 X:disk$0        root   18u      CHR              195,0        0t0      18487 /dev/nvidia0
X         2183 2334 X:disk$0        root   19u      CHR              195,0        0t0      18487 /dev/nvidia0
X         2183 2339 InputThre       root   15u      CHR            195,255        0t0        151 /dev/nvidiactl
X         2183 2339 InputThre       root   18u      CHR              195,0        0t0      18487 /dev/nvidia0
X         2183 2339 InputThre       root   19u      CHR              195,0        0t0      18487 /dev/nvidia0

But this does not seem to be the root cause - since bbswitch already fails to disable the dGPU during boot, long before Xorg is started up.

Here from journalctl -b:

Jun 24 02:55:05 felicity kernel: nvidia: loading out-of-tree module taints kernel.
Jun 24 02:55:05 felicity kernel: nvidia: module license 'NVIDIA' taints kernel.
Jun 24 02:55:05 felicity kernel: Disabling lock debugging due to kernel taint
Jun 24 02:55:05 felicity kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Jun 24 02:55:05 felicity kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Jun 24 02:55:05 felicity kernel: nvidia 0000:01:00.0: enabling device (0000 -> 0003)
Jun 24 02:55:05 felicity kernel: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Jun 24 02:55:05 felicity systemd[1]: Reloading.
Jun 24 02:55:05 felicity kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  430.26  Tue Jun  4 17:40:52 CDT 2019
Jun 24 02:55:05 felicity kernel: nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 241
Jun 24 02:55:05 felicity systemd[1]: Found device Samsung SSD 970 EVO 2TB 5.
Jun 24 02:55:05 felicity systemd[1]: Found device Samsung SSD 970 EVO 2TB BOOT.
Jun 24 02:55:05 felicity systemd[1]: Starting Cryptography Setup for cr_nvme-Samsung_SSD_970_EVO_2TB_S46ENB0M201744D-part5...
Jun 24 02:55:05 felicity kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  430.26  Tue Jun  4 17:45:09 CDT 2019
Jun 24 02:55:05 felicity systemd-cryptsetup[604]: Set cipher aes, mode xts-plain64, key size 512 bits for device /dev/disk/by-uuid/3482b2b4-9af9-4969-8c49-c07d1364e06f.
Jun 24 02:55:05 felicity kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Jun 24 02:55:05 felicity kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
...
Jun 24 02:55:10 felicity kernel: bbswitch: device 0000:01:00.0 is in use by driver 'nvidia', refusing OFF
Jun 24 02:55:10 felicity kernel: bbswitch: Succesfully loaded. Discrete card 0000:01:00.0 is on
...
Jun 24 02:55:10 felicity kernel: nvidia-gpu 0000:01:00.3: enabling device (0000 -> 0002)

In /etc/modprobe.d it is confusing me that there’s a “50-” prefixed and a “blank” version of nvidia-default.conf:

root@felicity:/etc/modprobe.d # exa -l          
Permissions Size User Date Modified Name
.rw-r--r--  3,8k root 12 Jun 18:28  00-system.conf
.rw-r--r--  1,2k root 14 Mär 16:58  10-unsupported-modules.conf
.rw-r--r--    45 root 12 Jun 20:03  50-bbswitch.conf
.rw-r--r--  5,0k root 14 Mär 16:58  50-blacklist.conf
.rw-r--r--   128 root 12 Jun 18:40  50-bluetooth.conf
.rw-r--r--    33 root 12 Jun 18:46  50-ipw2200.conf
.rw-r--r--    34 root 12 Jun 18:46  50-iwl3945.conf
.rw-r--r--  1,2k root 19 Jun 20:20  50-nvidia-default.conf
.rw-r--r--    18 root 12 Jun 18:46  50-prism54.conf
.rw-r--r--   668 root 12 Jun 18:28  60-blacklist_fs-adfs.conf
...
.rw-r--r--   664 root 12 Jun 18:28  60-blacklist_fs-ufs.conf
.rw-r--r--    47 root 14 Mär 16:58  99-local.conf
.rw-r--r--   158 root 12 Jun 19:01  firewalld-sysctls.conf
.rw-r--r--    18 root 12 Jun 17:06  nvidia-default.conf
.rw-r--r--   674 root  5 Apr 10:49  tuned.conf

Here’s the contents:

root@felicity:/etc/modprobe.d # cat 50-nvidia-default.conf 
options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=480 NVreg_DeviceFileMode=0660
install nvidia PATH=$PATH:/bin:/usr/bin; if /sbin/modprobe --ignore-install nvidia; then   if /sbin/modprobe nvidia_uvm; then     if  ! -c /dev/nvidia-uvm ]; then       mknod -m 660 /dev/nvidia-uvm c $(cat /proc/devices | while read major device; do if  "$device" == "nvidia-uvm" ]; then echo $major; break; fi ; done) 0;        chown :video /dev/nvidia-uvm;     fi;   fi;   if  ! -c /dev/nvidiactl ]; then     mknod -m 660 /dev/nvidiactl c 195 255;     chown :video /dev/nvidiactl;   fi;   devid=-1;   for dev in $(ls -d /sys/bus/pci/devices/*); do      vendorid=$(cat $dev/vendor);     if  "$vendorid" == "0x10de" ]; then       class=$(cat $dev/class);       classid=${class%%00};       if  "$classid" == "0x0300" -o "$classid" == "0x0302" ]; then          devid=$((devid+1));         if  ! -c /dev/nvidia${devid} ]; then            mknod -m 660 /dev/nvidia${devid} c 195 ${devid};            chown :video /dev/nvidia${devid};         fi;       fi;     fi;   done;   /sbin/modprobe nvidia_drm;   if  ! -c /dev/nvidia-modeset ]; then     mknod -m 660 /dev/nvidia-modeset c 195 254;     chown :video /dev/nvidia-modeset;   fi; fi root@felicity:/etc/modprobe.d # cat nvidia-default.conf 
blacklist nouveau
root@felicity:/etc/modprobe.d #

bbswitch is configured to disable the dGPU at module load (which it attempts according to journalctl -b):

root@felicity:/etc/modprobe.d # cat 50-bbswitch.conf 
options bbswitch load_state=0 unload_state=1

I found various hints about ACPI kernel parameters and found these to not be successful:


acpi_osi=! acpi_osi=Linux
acpi_osi=! acpi_osi="Windows 2009"
acpi_osi=! acpi_osi="Windows 2013"
acpi_osi=! acpi_osi="Windows 2015"
acpi_osi=! acpi_osi="Windows 2017"
acpi_osi=! acpi_osi="Windows 2018"

I also tried adding a “blacklist nvidia” to /etc/modprobe.d/50-nvidia-default.conf, but that caused the kernel to fail booting. It just hung endlessly very early in the boot process. Luckily I have a btrfs setup and could snapper rollback.

The BIOS/EC does not appear to have an option to disable the dGPU.

So this is where I’m out of wits here. Does anyone have an idea what else I could try?

Thank you so much in advance! This will finally give me a proper battery life… I hope :slight_smile:

PS (sorry if I was supposed to edit the OP - I did not find an edit button anywhere. Please let me know if there is one at all. :))

One last thing I tried in a separate tty (Ctrl+Alt+F1):


# systemctl stop display-manager
# rmmod nvidia
# lsmod | grep nvidia
i2c_nvidia_gpu      16304   0
# cat /proc/acpi/bbswitch
0000:01:00.0 ON
# tee /proc/acpi/bbswitch <<< OFF
OFF
# cat /proc/acpi/bbswitch
0000:01:00.0 OFF
# systemctl start display-manager.service

So as you can see, I could finally unload the nvidia kernel module after shutting down Xorg, and bbswitch was then able (according to its output) to power down the nvidia card.
BUT as soon as I started the display-manager.service again, the system froze the same way as it does at boot if I pass blacklist parameters like

modprobe.blacklist=nvidia,nvidia_drm,nvidia_uvm,nvidia_modeset

After a hardware power cycle, (holding down the power button to do a cold restart), there is no journalctl log for that Xorg crash/hang/freeze - neither an entry in /var/log/Xorg.0.log(.old).

PPS:

If instead of starting display-manager.service I run:


# prime-select intel

then the system freezes just exactly the same way.

Hi, see if this can help you https://forums.opensuse.org/showthread.php/536494-issue-with-OpenSuse-Prime-drivers?p=2906420#post2906420

Hi OrsoBruno,

Thank you for the hint - this is something I had not yet found.
However, unfortunately this does not seem to be it:

Other than the reporter in https://github.com/Witko/nvidia-xrun/issues/32 , I can unload nvidia_drm just fine.

I can unload nvidia_drm, nvidia_uvm and nvidia_modeset without stopping X.
But i can only unload (rmmod) the nvidia main module after “systemctl stop display-manager.service” or e.g. “systemctl isolate multi-user.target”.

I see to have three issues now, and I am not sure if hey are related:

Issue 1) The nvidia kernel module is loaded before the bbswitch kernel module can disable the dGPU. See in my OP under “Here from journalctl -b:”. I wonder if the kernel module load order is wrong, and bbswitch should be loaded before nvidia. But another big question is why nvidia is loaded at all. (I am lacking experience and knowledge regarding the details of kernel module loading, so maybe for you this is no big question at all. I am always happy to learn! :))

Issue 2) I cannot unload the nvidia driver without stopping X. Even though prime-select has only configured the Intel UHD card in /etc/X11/xorg.conf.d/90-intel.conf (and 90-nvidia.conf is not present).

Issue 3) The system freezes when doing certain stuff (I have been unable to isolate a cause or make sense of the freezes) after nvidia is unloaded (OR after successfully switching off the NVIDIA dGPU card).

3a) If I disable the nvidia modules before boot via grub-edited kernel parameters (modeset.blacklist=nvidia_drm,nvidia_uvm,nvidia_modeset,nvidia) then the system freezes during boot. It does not react to any keys (no tty changes, no NumLock LED etc.) and only holding down the power button until the hardware shuts down is a way out.

3b) If I unload the nvidia module in a running system and then try to switch off the card via “tee /proc/acpi/bbswitch <<< OFF”, then the system feezes as well.
However, the behavior here seems inconsistent, since last night I had been able to switch off the card and confirm it OFF by “cat /proc/acpi/bbswitch”, and the system froze when trying to start ‘display-manager.service’ again. Maybe a process fails only a certain time after the dGPU was switched off?

3b I.) Here’s what I just tried following the link you posted above (I’m typing this off a photo of my screen made after the freeze):


- Fresh boot to sddm.
- No login to plasma, but instead Ctrl+Alt+F1 tty switch.
- Login as root on tty1

# lsof | grep /dev/nvidia
(12 lines showing process X (same pid for all 12 lines) holding open /dev/nvidiactl and /dev/nvidia0 even though nvidia is not configured in /etc/X11/xorg.conf.d/)
# systemctl isolate multi-user.target
(sddm and X are stopped successfully)
# systemctl stop systemd-logind
(stopping systemd-logind does not seem to make a difference here! nvidia_drm is not the problem here and can be unloaded at any time, even from Konsole in a plasma session.)
# lsof | grep /dev/nvidia
(no output now, since X is stopped)
# rmmod nvidia_drm nvidia_uvm nvidia_modeset nvidia
(no output =success - this always works as soon as X is stopped and does not depend on stopping systemd-logind)
# cat /proc/acpi/bbswitch
0000:01:00.0 ON
# tee /proc/acpi/bbswitch <<< OFF
OFF
(cursor blinks about three times in new line after printing "OFF", 
 then everything freezes and I can only hard-reset by holding down power button)

3b II.) Yesterday night I was able to “cat /proc/acpi/bbswitch” and “systemctl start display-manager.service” before the system froze. See #2 ( https://forums.opensuse.org/showthread.php/536529-bbswitch-cannot-power-down-nvidia-dGPU-device-is-in-use-by-driver-nvidia-refusing-OFF?p=2906708#post2906708 )

I am unsure if this is a matter of which command is run at the time of freeze - much rather it might be a timing issue where a certain process hangs up after {n} seconds of having switched off the dGPU.
This theory is supported by my attempt to modprobe.blacklist nvidia at boot time and removing the “quiet” option: The last boot process output before freezing was something related to snd_intel_hda which is going to be pretty much unrelated to the actual issue. I expected nvidia, bbswitch, X or sddm/display-manager to appear as last line before the freeze.

A closing note:
This is a pretty freshly installed Tumbleweed from an USB image. It cannot have diverged much from Tumbleweed’s defaults. In any case I’ll be happy to share any configuration or log output.

PS:
Another bevhavior that had surprised me last night was this:
If I open a tty, login in as root, stop the display manager and do “prime-select intel”, then the system freezes.

But if I open a tty, login in as root, stop the display manager and do first “prime-select nvidia” and then "prime-select intel, then it does not freeze. (I will right now re-try this and come back to you in another PS.)

Nvidia itself also works well: I can stop sddm, “prime-select nvidia” and successfully run a plasma session based on X with nvidia.
Without bbswitch I can also run plasma sessions on X with intel (my default). Freezing only occurs as soon as switching off the dGPU is involved.

On windows 10 (dual boot) optimus does not cause freezes.

PPS:
I wrote above “But if I open a tty, login in as root, stop the display manager and do first “prime-select nvidia” and then "prime-select intel, then it does not freeze. (I will right now re-try this and come back to you in another PS.)
Now here’s the results:

  • logged off plasma
  • Ctrl+Alt+F1 to tty1
  • login as root

# systemctl isolate multi-user.target
# prime-select nvidia
Driver configured: nvidia
# prime-select intel
Driver configured: intel
# _

This supports the timing theory: prime-select runs through until the end and prints “Driver configured: intel”. At this point, the dGPU must be OFF, else prime-select would complain.
Before the system freezes, zsh shows the new “#” root prompt and the “_” cursor (which then stops blinking and randomly is either shown or hidden permanently until my harsh cold reboot).

Previously, when I had done a “tee /proc/acpi/bbswitch << OFF”, the bbswitch kernel module (? I guess?) printed “OFF”, but the system froze before zsh could render the new “#” prompt. This seems more random occurence than rule to me.

  1. I expected TLP to interfere with ACPI, so I disabled it (systemctl disable tlp; systemctl stop tlp) and re-tried. The system still froze. (I had default /etc/default/tlp without any modifications of RUNTIME_PM_DRIVER_BLACKLIST (config file comment says nvidia is disabled for tlp by default) nor RUNTIME_PM_BLACKLIST)

  2. The freeze might depend on the card’s power state?

Here a photo of what I describe below:
https://i.ibb.co/8PC3cXt/IMG-20190624-191728.jpg](https://ibb.co/8PC3cXt)

  • reboot after having disabled tlp. Confirmed that tlp is dead & disabled.
  • logged out of plasma
  • tty1
  • login as root
  • isolate multi-user.target
  • prime-select nvidia (successful)
  • start graphical.target
  • switch to tty7, log into plasma - works.
  • log out from plasma
  • switch back to tty1 (root still logged in)
  • isolate multi-user.target
  • trying to catch some log output before freeze by calling in one row: prime-select intel; journalctl -f
  • to my surprise, journalctl shows and the system stays responsive.
  • On the photo you can see that the nvidia modules are unloaded and bbswitch disables the dGPU.
  • Then the kernel says “Refused to change power state, currently in D0”. (0000:01:00.0 is indeed my nvidia gpu pci path)
  • Hit Ctrl+C to find system respond to it by going back to zsh root prompt
  • Hit arrow up key, then enter. I just wanted to run journalctl -f again, forgot that I also had prime-select intel in this line.
  • System freezes.

=> Switching to intel is successful if before switched to nvidia.
=> If switching to intel twice in a row, system freezes.
=> If switching to nvidia, then intel, then system is responsive until starting display manager, which freezes.

Following https://github.com/Bumblebee-Project/Bumblebee/issues/664 I checked powertop tunables:

>> Bad           Runtime PM for PCI Device NVIDIA Corporation TU106M [GeForce RTX 2060 Mobile]

Runtime PM seems off. Hitting Enter on this tunable twice shows:

>> echo 'on' > '/sys/bus/pci/devices/0000:01:00.0/power/control';

This seems to be the desired state for bbswitch to work properly. (The alternative option, causing powertop to say “Good”, would be ‘auto’.)

Here’s all tunables:

PowerTOP v2.10    Overview   Idle stats   Frequency stats   Device stats   Tunables   WakeUp                            
>> echo 'on' > '/sys/bus/pci/devices/0000:01:00.0/power/control';

   Bad           Enable SATA link power management for host0
   Bad           Enable Audio codec power management
   Bad           Autosuspend for USB device USB Gaming Mouse [Logitech]
   Bad           Autosuspend for USB device ITE Device(8291) [ITE Tech. Inc.]
   Bad           Runtime PM for PCI Device Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers
   Bad           Runtime PM for PCI Device Intel Corporation Cannon Lake PCH PCI Express Root Port #9
>> Bad           Runtime PM for PCI Device NVIDIA Corporation TU106M [GeForce RTX 2060 Mobile]                          
   Bad           Runtime PM for PCI Device Intel Corporation Cannon Lake PCH PCI Express Root Port #21
   Bad           Runtime PM for PCI Device Intel Corporation Cannon Lake PCH SPI Controller
   Bad           Runtime PM for PCI Device Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller
   Bad           Runtime PM for PCI Device Intel Corporation Cannon Lake PCH Shared SRAM
   Bad           Runtime PM for PCI Device Intel Corporation Cannon Lake PCH cAVS
   Bad           Runtime PM for PCI Device Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller
   Bad           Runtime PM for PCI Device Intel Corporation Cannon Lake PCH SMBus Controller
   Bad           Runtime PM for PCI Device Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0
   Bad           Runtime PM for PCI Device Intel Corporation Cannon Lake PCH Thermal Controller
   Bad           Runtime PM for PCI Device Intel Corporation Device a328
   Bad           Runtime PM for PCI Device Intel Corporation Device a30d
   Bad           Runtime PM for PCI Device NVIDIA Corporation TU106 USB Type-C Port Policy Controller
   Bad           Runtime PM for PCI Device Intel Corporation Wireless-AC 9560 [Jefferson Peak]
   Bad           Runtime PM for PCI Device NVIDIA Corporation TU106 USB 3.1 Host Controller
   Bad           Runtime PM for PCI Device NVIDIA Corporation TU106 High Definition Audio Controller
   Bad           Runtime PM for PCI Device Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
   Bad           Runtime PM for PCI Device Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981
   Bad           Runtime PM for PCI Device Intel Corporation UHD Graphics 630 (Mobile)
   Good          Bluetooth device interface status
   Good          VM writeback timeout
   Good          NMI watchdog should be turned off
   Good          I2C Device i2c-UNIW0001:00 has no runtime power management
   Good          Autosuspend for USB device xHCI Host Controller [usb2]
   Good          Autosuspend for USB device HD Webcam [Chicony Electronics Co.,Ltd.]
   Good          Autosuspend for USB device USB3.0-CRW [Generic]
   Good          Autosuspend for USB device xHCI Host Controller [usb4]
   Good          Autosuspend for USB device xHCI Host Controller [usb1]
   Good          Autosuspend for USB device xHCI Host Controller [usb3]
   Good          Autosuspend for unknown USB device 1-14 (8087:0aaa)
   Good          Runtime PM for PCI Device Intel Corporation Cannon Lake PCH HECI Controller
   Good          Runtime PM for PCI Device Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16)
   Good          Runtime PM for PCI Device Intel Corporation Cannon Lake PCH PCI Express Root Port #14

My boot command line is:

Command line: BOOT_IMAGE=/boot/vmlinuz-5.1.7-1-default root=/dev/mapper/system-root quiet resume=/dev/system/swap acpi_osi=! "acpi_osi=Windows 2013" mitigations=auto

I have tried various acpi_osi values, and Windows 2013 appears to work best for me in terms of keyboard lighting and screen backlight control. I have not found any difference to the bbswitch problem while trying acpi_osi values.

According to https://github.com/Bumblebee-Project/Bumblebee/issues/978 bbswitch should not be needed any more to power down the dGPU. Unloading nvidia modules should be enough.

Am I doing this all wrong?

Sorry, it has been a long time since I last ran suse-prime on TW and cannot setup a test system at the moment. What I can do is point you to the SDB page https://en.opensuse.org/SDB:NVIDIA_SUSE_Prime where I find no mention of bbswitch.
If strictly following the SDB doesn’t work, maybe somebody else with recent experience in suse-prime might be able to help.

Hey OrsoBruno,

thanks for caring!

Indeed https://en.opensuse.org/SDB:NVIDIA_SUSE_Prime is not mentioning bbswitch - but prime-select itself is complaining if bbswitch is not installed.

Maybe that’s a prime-select bug - if bbswitch is no longer needed and recommended for power management of dGPUs?

AFAIK bbswitch is still part of the bumblebee installation, so apparently it might be needed in some setups. The fact is that power management of dGPUs is tricky and discovering problems is not that uncommon.
Maybe your driver/GPU has problems recovering from some sleep state and the system freezes waiting for a signal from the GPU that never comes (I’ve seen that with the nouveau driver, didn’t use latest proprietary drivers).
In general, fiddling with power management tools like TLP, powertop etc. may be problematic with optimus laptops.
I notice that your GPU is a fairly new type, maybe you have to check an updated driver; maybe googling for Turing GPU reveals if others are seeing related problems.
I’m sorry that to fully debug that you need a better expert than me :frowning:

bbswitch is part of the bumblebee system, but prime-select / suse-prime is unrelated to bumblebee. At first I had only suse-prime installed and was surprised to find the script complaining about not being able to shut down the card because bbswitch was not installed.

Indeed I might be suffering from a variation of this bug:

https://bugzilla.kernel.org/show_bug.cgi?id=156341

I had been running nvidia proprietary drivers and found the system freezing during prime-select, which might be the described “few moments later an AML_INFINITE_LOOP is reported” from the kernel.org bugzilla. (However, my lspci seemed to be able to return, since prime-select often managed to run through before the freeze occured.)

In the mean time I uninstalled bbswitch, its kernel module and the entire proprietary nvidia shebang and basically gave up. My idealist heart cries, but I do still have a Windows 10 grub option to play games if I must. (How about video editing, though? :sarcastic:)

I still often (but not always) had the system freeze when hitting sddm after suspend-resume, but today I already had a bunch of suspend resume cycles without freezes, since booting with these added kernel boot parameters:

acpi_osi=! acpi_osi=Linux acpi_rev_override=1 nouveau.modeset=0 pcie_aspm=force drm.vblankoffdelay=1 scsi_mod.use_blk_mk=1 nouveau.runpm=0 mem_sleep_default=deep

(Comment #7 : Bug #1803179 : Bugs : linux package : Ubuntu)
The battery drain is still far from satisfying (max 1.5h until the 62Wh battery is empty). No idea if nouveau.runpm=0 means ACPI can’t switch off the card, either or if it just takes the card’s power management off nouveau’s task list.

Thanks so much for your help, though! :slight_smile:

Hi again, I see that 3 days ago there was a commit to a suse-prime package that uses bbswitch to power down the dGPU: https://software.opensuse.org/package/suse-prime-beta-bbswitch?search_term=suse-prime-beta-bbswitch
Don’t know if it might solve your problem and as usual with “experimental” packages use it at your own risk.
If you didn’t throw in the towel yet, that might be worth a try.
Have fun!