KVM partially and irregularly freezes the Windows 11 guest and the host system

CPU: AMD Ryzen 7 3700X 8-Core Processor
Memory : 32 Gb

GPU1 Host System : AMD Radeon RX 480 Graphics - Monitor 1 and 2
GPU2 dedicated for Win11 Guest: AMD Radeon RX 480 Graphics - Monitor 3 and 4

The same problem when using the same Win11 Guest in an Host window without the dedicated GPU 2

A zypper dup update resulted in a change in the KVM Qemu system. I can’t say when this happened because I haven’t used the Windows guest for a while.
The fact is that I have a snapshot from September, everything works there, with the updates of the last four weeks the guest and the host keep hanging.

It’s not that the guest uses up all the resources, on the contrary, when the guest goes from about 200% CPU to 10% CPU (qemu-system-x86), the whole system starts to hang. Even mouse and keystrokes hang.
I have switched back and forth between the snapshots several times, it is definitely not the guest, but the host, i.e. the Tumbleweed update.
Now I have both snapshots, but I can’t narrow down what the problem might be. I can’t report a bug either because I don’t know what could be wrong.

Can anyone help me narrow it down?

I cannot help, but I can report the same issue without the complicated setup. I’ve checked the journal logs but nothing seems to have broken.

So it looks like a BUG.
The second GPU is completely inactive if you start the Windows guest in window mode, i.e. without GPU pass through.
Therefore, we have the same setup in this case and should consider where to report the BUG. Kernel or KVM/Qemu?

@etron770 I’m not seeing it here with Windows 11 Pro, I do have a Nvidia K620 passed through, but I only connect externally over spice or via console, no monitor connected. I do have qemu-ovmf-x86_64 202402-1.1 locked for a Rancher Desktop bug.

qemu-ovmf-x86_64 Version : 202408-1.1
is working with the September 2024 Rollback

@etron770 maybe AMD related then, I’m on Intel Xeon and Nvidia…

It looks like a memory problem.
I have logged into the host with an ssh console and see that the swap is full.
The host system has 32 Gb The Windows system has 16 Gb
no shared memory with the host
The entire system hangs because the entire available memory, including swap, is full.
And that only with the October version of Tumbleweed.
Regardless of the fact that the swap is very small at 32Gb with 2Gb, the host should not go into the swap at all.
Instead, the host gives the required memory to the buff/cache instead of making it available to the host/guest system.
As expected, giving the Windows guest only 10 GB does not help either
Windows guest 16Gb:
windows16Gb
Windows guest 10Gb
Windwos10Gb
Windows guest inactive
NoWindows

@etron770 I have 6 cores/12 threads and 32GB of ram allocated to the Windows vm.

free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi       4.4Gi        69Gi       114Mi        53Gi       121Gi
Swap:             0B          0B          0B

virsh start Windows_11_Pro 
Domain 'Windows_11_Pro' started

free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi        36Gi        36Gi       124Mi        53Gi        88Gi
Swap:             0B          0B          0B

But what is the difference between the September version of KVM/Qumu/Kernel and the current version?
The waitstates show that something is being written again and again, wherever it is:
When I start Windows without swap, the complete system hangs:
win16swapsoff

If I still manage to switch on the swap, the swap fills up and I can work (slowly) again.
win16swapswitchedon

@etron770 sure it’s not something wrong with the Windows vm?

If the buf/cache is deleted before Windows is started, the problem is temporarily solved

echo 3 > /proc/sys/vm/drop_caches +

Start VM
but …
The system starts very quickly but after a while the swap starts to fill up again and the waitstates increase

Do you think it’s a BUG, and if so, where should it be reported?
It is completely absurd that the buff/cache continues to grow after deletion, before Windows starts, and the system swaps everything to the swap:
19Gb buff/cache and 12Gb swap:
bigbuff

@etron770 Probably against qemu…

free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi       4.0Gi       121Gi       140Mi       1.4Gi       121Gi
Swap:             0B          0B          0B

virsh start Windows_11_Pro 
Domain 'Windows_11_Pro' started

free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi        36Gi        89Gi       142Mi       1.6Gi        89Gi
Swap:             0B          0B          0B

virsh shutdown Windows_11_Pro 
Domain 'Windows_11_Pro' is being shutdown
virsh list
 Id   Name             State
--------------------------------
 1    Windows_11_Pro   running
virsh list
 Id   Name   State
--------------------

free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi       4.1Gi        80Gi       128Mi        41Gi       121Gi
Swap:             0B          0B          0B

but in your setup the buffer is smaller than the free memory.
In my case, the buffer grows and the free memory decreases.
Even without a VM being started without qemu.
total used free shared buff/cache
mem: 31Gi 8,6Gi 2,9Gi 351Mi 20Gi 22Gi
Swap: 147Gi 787Mi 147Gi
Something is wrong with my system …

@etron770 something is leaking somewhere then…

The distribution where the Windows guest runs without problems and the swap is not filled up is with the kernel 6.10.9-1-default
The only difference is a Zypper dup to the latest distribution

From this point on, the swap fills up
kernel 6.10.9-1-default memory:
free -h
total used free shared buff/cache available
Mem: 31Gi 21Gi 341Mi 127Mi 10Gi 10Gi
Swap: 157Gi 6,8Mi 157Gi

Probably solved

For historical reasons, the swap was too small in relation to the ram. I have neglected to adjust this over the years.
The new release then changed something so that the limit of the swap was apparently exceeded.
I have increased the swap and so far the system is running smoothly.

I don’t understand why the buff/cache was not reduced. But it looks as if this was already the case in September with the release at that time.
I have looked everywhere, except for the memory and/or swap

Partially solved

partially:
I need to su root: echo 3 > /proc/sys/vm/drop_caches
before starting the Windows guest - but I’m not sure if it always works

The previous solution was only sometimes successful.
I booted with the 6.10.9-1-default kernel.
As you can see from the screenprint, there are no more waitstates, the swap remains empty and the buff/cache do not increase.
The qemu-system-x86 gets the CPU time it needs.
Even if I start a second virtual machine that requires more RAM than is available in the buff/cache, the system is slower, but usable.
If I start a second virtual machine with RAM below the free available memory, there is no noticeable difference.

Now that the system has been running for an hour, with no change, and previously with multiple virtual machine starts and stops, I mark the thread as resolved.
The rest would have to be solved by the kernel developers, am I right?

cat /etc/os-release
NAME="openSUSE Tumbleweed"
# VERSION="20241104"
ID="opensuse-tumbleweed"
ID_LIKE="opensuse suse"
VERSION_ID="20241104"
PRETTY_NAME="openSUSE Tumbleweed"
ANSI_COLOR="0;32"
# CPE 2.3 format, boo#1217921
CPE_NAME="cpe:2.3:o:opensuse:tumbleweed:20241104:*:*:*:*:*:*:*"
#CPE 2.2 format
#CPE_NAME="cpe:/o:opensuse:tumbleweed:20241104"
BUG_REPORT_URL="https://bugzilla.opensuse.org"
SUPPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Tumbleweed"
LOGO="distributor-logo-Tumbleweed"

oldkernel