Problems with PCIe Passthrough

So I’ve been trying to pass my second GPU to a windows VM today and haven’t had any trouble setting up iommu and the correct kernel modules to load. My desired GPU loads with the vfio-pci driver, and I have a working Windows 10 VM. When I add my GPU as a pci device to the VM it gives me this strange error about allocation. I’m really unsure of what to make of this error. Any help is appreciated as I’ve been pulling my hair out all night going over my steps to see if I did something wrong with no luck.

Unable to complete install: 'internal error: process exited while connecting to monitor: 2024-07-01T05:21:56.428786Z qemu-system-x86_64: vfio: hot reset info failed: No such file or directory
2024-07-01T05:21:56.430051Z qemu-system-x86_64: vfio: hot reset info failed: No space left on device
2024-07-01T05:21:56.433256Z qemu-system-x86_64: GLib: ../glib/gmem.c:177: failed to allocate 18446744058237017884 bytes'

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 71, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/createvm.py", line 2088, in _do_async_install
    installer.start_install(guest, meter=meter)
  File "/usr/share/virt-manager/virtinst/install/installer.py", line 737, in start_install
    domain = self._create_guest(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/share/virt-manager/virtinst/install/installer.py", line 679, in _create_guest
    domain = self.conn.createXML(initial_xml or final_xml, 0)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/site-packages/libvirt.py", line 4545, in createXML
    raise libvirtError('virDomainCreateXML() failed')
libvirt.libvirtError: internal error: process exited while connecting to monitor: 2024-07-01T05:21:56.428786Z qemu-system-x86_64: vfio: hot reset info failed: No such file or directory
2024-07-01T05:21:56.430051Z qemu-system-x86_64: vfio: hot reset info failed: No space left on device
2024-07-01T05:21:56.433256Z qemu-system-x86_64: GLib: ../glib/gmem.c:177: failed to allocate 18446744058237017884 bytes

Oh I should’ve mentioned this I’m trying to do this with libvirt through virt-manager

@jb309817 Hi and welcome to the Forum :smile:
So you have iommu set, kvm modules installed and setup an aliases file for the device?
eg;

10-kvm.conf 
options kvm ignore_msrs=1
options kvm report_ignored_msrs=0

11-vfio.conf 
alias pci:v000010DEd00001FB2sv000010DEsd00001489bc03sc00i00 vfio-pci
alias pci:v000010DEd000010FAsv000010DEsd00001489bc04sc03i00 vfio-pci
options vfio-pci ids=10de:1fb2:10de:1489,10de:10fa:10de:1489
options vfio-pci disable_vga=1

But space would indicate disk space issue, sure you selected the correct device when adding the gpu?

Thank you for the warm welcome! Okay now that I’ve had some sleep let me go over a bit more what I did. Though I am passing through an AMD card I mainly followed this guide. I also referenced this guide a bit.

Right after I made this post I realized I had not written anything to kvm.conf and added options kvm ignore_msrs=1

I still got the same error but after actually plugging in a display cable to the card to my surprise the VM booted and allowed me to install drivers for the card, but after a reboot the exact same error persisted. My computer has a fresh installation of tumbleweed and has plenty of disk space as well as the VM having a 800gb qcow virtual disk.

You mentioned some options in vfio.conf that I do not have, so I am going to try these now. In the meantime here is some of my configuration

/etc/modprobe.d/vfio.conf

options vfio-pci ids=1002:731f,1002:ab38

/etc/modprobe.d/kvm.conf

options kvm ignore_msrs=1

Occasionally it will give me this very similar error instead of the one in the original post

Error starting domain: internal error: QEMU unexpectedly closed the monitor (vm='win10'): 2024-07-01T15:12:51.479795Z qemu-system-x86_64: GLib: ../glib/gmem.c:177: failed to allocate 18446744073709551412 bytes

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 71, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 107, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 57, in newfn
    ret = fn(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/virt-manager/virtManager/object/domain.py", line 1428, in startup
    self._backend.create()
  File "/usr/lib64/python3.11/site-packages/libvirt.py", line 1379, in create
    raise libvirtError('virDomainCreate() failed')
libvirt.libvirtError: internal error: QEMU unexpectedly closed the monitor (vm='win10'): 2024-07-01T15:12:51.479795Z qemu-system-x86_64: GLib: ../glib/gmem.c:177: failed to allocate 18446744073709551412 bytes

I don’t understand what this allocation of so many bytes is even for its not a configuration that I set up and it persists between completely new installations so long as I use the GPU and Auido PCI devices for my card.

Here is dmesg from after trying to initialize the VM

t[    T798] hid-generic 0003:1377:6004.0008: input,hiddev100,hidraw7: USB HID v1.11 Device [Sennheiser electronic GmbH & Co. KG MOMENTUM 3] on usb-0000:00:14.0-2.2/input2
[    T190] Bluetooth: hci0: Intel BT fw patch 0x43 completed & activated
[   T1147] Bluetooth: MGMT ver 1.22
[   T1417] NET: Registered PF_ALG protocol family
[   T1381] Generic FE-GE Realtek PHY r8169-0-600:00: attached PHY driver (mii_bus:phy_addr=r8169-0-600:00, irq=MAC)
[    T261] r8169 0000:06:00.0 enp6s0: Link is Down
[   T1381] iwlwifi 0000:07:00.0: Registered PHC clock: iwlwifi-PTP, with index: 0
[   T1518] NET: Registered PF_PACKET protocol family
[    T121] scsi 6:0:0:0: Direct-Access     Samsung  Flash Drive      1100 PQ: 0 ANSI: 6
[    T121] sd 6:0:0:0: Attached scsi generic sg2 type 0
[    T123] sd 6:0:0:0: [sdc] 125313283 512-byte logical blocks: (64.2 GB/59.8 GiB)
[    T123] sd 6:0:0:0: [sdc] Write Protect is off
[    T123] sd 6:0:0:0: [sdc] Mode Sense: 43 00 00 00
[    T123] sd 6:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    T123]  sdc: sdc1 sdc2
[    T123] sd 6:0:0:0: [sdc] Attached SCSI removable disk
[    T763] ACPI: \_TZ_.TZ10: _PSL evaluation failure
[    T763] thermal LNXTHERM:00: registered as thermal_zone2
[    T763] ACPI: thermal: Thermal Zone [TZ10] (17 C)
[    T763] ACPI: \_TZ_.TZ20: _PSL evaluation failure
[    T763] thermal LNXTHERM:01: registered as thermal_zone3
[    T763] ACPI: thermal: Thermal Zone [TZ20] (17 C)
[    T763] thermal LNXTHERM:02: registered as thermal_zone4
[    T763] ACPI: thermal: Thermal Zone [TZ00] (28 C)
[   T1721] vfio-pci 0000:03:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
[   T1721] amdgpu 0000:08:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[   T1746] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   T1846] tun: Universal TUN/TAP device driver, 1.6
[   T1708] virbr0: port 1(vnet0) entered blocking state
[   T1708] virbr0: port 1(vnet0) entered disabled state
[   T1708] vnet0: entered allmulticast mode
[   T1708] vnet0: entered promiscuous mode
[   T1708] virbr0: port 1(vnet0) entered blocking state
[   T1708] virbr0: port 1(vnet0) entered listening state
[   T1853] vfio-pci 0000:03:00.0: enabling device (0002 -> 0003)
[   T1853] vfio-pci 0000:03:00.1: enabling device (0000 -> 0002)
[   T1853] show_signal: 149 callbacks suppressed
[   T1853] traps: qemu-system-x86[1853] trap int3 ip:7ff07346fb99 sp:7ffe847777a0 error:0 in libglib-2.0.so.0.8000.3[7ff07342a000+99000]
[   T1896] virbr0: port 1(vnet0) entered disabled state
[   T1896] vnet0 (unregistering): left allmulticast mode
[   T1896] vnet0 (unregistering): left promiscuous mode
[   T1896] virbr0: port 1(vnet0) entered disabled state
[    T704] systemd-journald[704]: File /var/log/journal/d7b7ea36e0124370ac74d61d7455946f/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
[   T2446] Bluetooth: RFCOMM TTY layer initialized
[   T2446] Bluetooth: RFCOMM socket layer initialized
[   T2446] Bluetooth: RFCOMM ver 1.11
[    T152] usb 1-7: USB disconnect, device number 4
[    T152] usb 1-7.2: USB disconnect, device number 7
[    T152] usb 1-7: new high-speed USB device number 10 using xhci_hcd
[    T152] usb 1-7: New USB device found, idVendor=1a40, idProduct=0101, bcdDevice= 1.11
[    T152] usb 1-7: New USB device strings: Mfr=0, Product=1, SerialNumber=0
[    T152] usb 1-7: Product: USB 2.0 Hub
[    T152] hub 1-7:1.0: USB hub found
[    T152] hub 1-7:1.0: 4 ports detected
[    T152] usb 1-7.2: new full-speed USB device number 11 using xhci_hcd
[    T152] usb 1-7.2: New USB device found, idVendor=258a, idProduct=003a, bcdDevice=10.02
[    T152] usb 1-7.2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[    T152] usb 1-7.2: Product: Gaming KB 
[    T152] usb 1-7.2: Manufacturer: SINO WEALTH
[    T152] input: SINO WEALTH Gaming KB  as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7.2/1-7.2:1.0/0003:258A:003A.0009/input/input25
[    T152] hid-generic 0003:258A:003A.0009: input,hidraw5: USB HID v1.11 Keyboard [SINO WEALTH Gaming KB ] on usb-0000:00:14.0-7.2/input0
[    T152] input: SINO WEALTH Gaming KB  System Control as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7.2/1-7.2:1.1/0003:258A:003A.000A/input/input26
[    T152] input: SINO WEALTH Gaming KB  Consumer Control as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7.2/1-7.2:1.1/0003:258A:003A.000A/input/input27
[    T152] input: SINO WEALTH Gaming KB  Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7.2/1-7.2:1.1/0003:258A:003A.000A/input/input28
[    T152] hid-generic 0003:258A:003A.000A: input,hiddev99,hidraw6: USB HID v1.11 Keyboard [SINO WEALTH Gaming KB ] on usb-0000:00:14.0-7.2/input1
[    T152] r8169 0000:06:00.0 enp6s0: Link is Up - 100Mbps/Full - flow control rx/tx
[   T1696] virbr0: port 1(vnet1) entered blocking state
[   T1696] virbr0: port 1(vnet1) entered disabled state
[   T1696] vnet1: entered allmulticast mode
[   T1696] vnet1: entered promiscuous mode
[   T1696] virbr0: port 1(vnet1) entered blocking state
[   T1696] virbr0: port 1(vnet1) entered listening state
[   T4906] traps: qemu-system-x86[4906] trap int3 ip:7f923b9bab99 sp:7fff90929aa0 error:0 in libglib-2.0.so.0.8000.3[7f923b975000+99000]
[   T4936] virbr0: port 1(vnet1) entered disabled state
[   T4936] vnet1 (unregistering): left allmulticast mode
[   T4936] vnet1 (unregistering): left promiscuous mode
[   T4936] virbr0: port 1(vnet1) entered disabled state
[    T119] BTRFS info (device nvme1n1p3): qgroup scan completed (inconsistency flag cleared)

I should also not I’ve had very minimal, but real hardware issues with this card before which results in a driver crash while under heavy load, but I don’t think that is the cause because as I said it did once boot up and allow me to install Windows drivers but after a reboot of the VM it just went back to giving this same error.

@jb309817 you need both ID’s from the output of /sbin/lspci -nnk | grep -A3 "VGA|Display|3D|Audio" for the gpu and sound device... also add the alias line by using for example cat /sys/bus/pci/devices/0000:03:00.0/modalias` where in this case 03 is the bus id.

Then run dracut -f --regenerate-all and reboot and test.

These are both the devices from lspci -nnk

03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev c1)
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Reference RX 5700 XT [1002:0b36]
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio [1002:ab38]
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio [1002:ab38]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

I added the alias’ as you described. It’s now doing this. I can get it to work if I remove the PCI devices from the VM and add them back and uncheck ROM BAR for both devices. The VM will boot correctly using the GPU but after I reboot the VM I get this error.

Error starting domain: internal error: Unknown PCI header type '127' for device '0000:03:00.0'

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 71, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 107, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 57, in newfn
    ret = fn(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/virt-manager/virtManager/object/domain.py", line 1428, in startup
    self._backend.create()
  File "/usr/lib64/python3.11/site-packages/libvirt.py", line 1379, in create
    raise libvirtError('virDomainCreate() failed')
libvirt.libvirtError: internal error: Unknown PCI header type '127' for device '0000:03:00.0'

If I then restart libvirtd I get the original error again… I guess this is much better than it not working at all though. EDIT If I restart libvirtd I continue to get the Unknown PCI header type 127 error but if I reboot the host I get the original error again. Removing and reattaching the PCI devices fixes it until another reboot.

Are you saying I also need to add the subsystem device ID’s to vfio.conf aswell?

If I just restart libvirtd the devices no longer show up as options for PCI devices to attach to the VM in virt-manager unless I restart my host as well. This is getting more and more strange.

@jb309817 yes it should be;

options vfio-pci ids=1002:731f:1002:0b36,1002:ab38:1002:ab38

The first ones are the gpu and then a comma, with the audio ones

So it seems to work more constantly now as long as I remove and reattach the device after a host reboot. I’m still experiencing the issue of the device now disappearing from the PCI device list in virt-manager after a reboot of the guest, but I am going to try installing fedoras guest tools inside the VM and see if that resolves that issue.

I guess the rebooting issue with the VM is a common issue with AMD cards

@jb309817 So in my case I have an intel CPU and add intel_iommu=on to kernel boot options (YaST Bootloader), is this done?

I connect locally or remotely with the libvirt-client, so no monitor attached, plus use Nvidia…

Yes I’ve done this. I just added those subsystem IDs aswell. It seems to either be an issue with my motherboard bios or the vbios on the card not resetting the card correctly after the VM resets.

What do you think I do about this?

Considering throwing in the bag and replacing the rx 5700xt with my 3060 lol

@jb309817 so it didn’t get it’s own iommu group? If not that’s not the card it’s the Motherboard…

I updated my bios to the latest version and now I cannot boot the VM at all without the original error

Error starting domain: internal error: QEMU unexpectedly closed the monitor (vm='win10'): 2024-07-01T17:44:53.493541Z qemu-system-x86_64: vfio: hot reset info failed: No such file or directory
2024-07-01T17:44:53.494340Z qemu-system-x86_64: GLib: ../glib/gmem.c:177: failed to allocate 18446744058182308292 bytes

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 71, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 107, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 57, in newfn
    ret = fn(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/virt-manager/virtManager/object/domain.py", line 1428, in startup
    self._backend.create()
  File "/usr/lib64/python3.11/site-packages/libvirt.py", line 1379, in create
    raise libvirtError('virDomainCreate() failed')
libvirt.libvirtError: internal error: QEMU unexpectedly closed the monitor (vm='win10'): 2024-07-01T17:44:53.493541Z qemu-system-x86_64: vfio: hot reset info failed: No such file or directory
2024-07-01T17:44:53.494340Z qemu-system-x86_64: GLib: ../glib/gmem.c:177: failed to allocate 18446744058182308292 bytes

I hate to say it but I’m worried its some kind of hardware fault. I’m going to try Nvidia

@jb309817 can you check the logs journalctl -b | grep iommu are the devices being added to their own groups?

Just to update this thread. I switch to my 3060 which tends to be fine on first boot passing through to the VM but after a reboot of the VM I get a very similar error and the host must restart before I can start the VM again. Perhaps a Qemu bug?

2024-07-02T20:31:31.073373Z qemu-system-x86_64: vfio: Cannot reset device 0000:00:14.0, no available reset mechanism.
2024-07-02T20:31:31.073653Z qemu-system-x86_64: vfio: hot reset info failed: No space left on device
2024-07-02T20:31:31.075053Z qemu-system-x86_64: vfio: Cannot reset device 0000:00:14.0, no available reset mechanism.
2024-07-02T20:31:31.075335Z qemu-system-x86_64: GLib: ../glib/gmem.c:177: failed to allocate 18446744065142707292 bytes

So far its working well enough that I’m not going to put too much effort into resolving this though I hope I can fix it later.

@jb309817 So if you run /sbin/lspci -nnk what is the device with the PCI ID of 0000:00:14.0?

I seem to face a similar problem; GPU Passthrough (Nvidia RTX 3080TI-> Win11) was working fine; today i ran a full distro update ( and with that an update to qemu 9), and could not start up my VM (same error as jb309817, stating 'failed to allocate 18… bytes);
Side fact: VM starts just fine with PCI passthrough, when I remove my NVIDIA Audio Device - but i checked, and they are (still) in the same IOMMU group)