QEMU/KVM: virtiofs causing major freeze problem

paju-21 · June 15, 2023, 4:28pm

Quite recently there was apparently a major update in virtiofs because now basic functionality i.e. sharing a folder from host in guest leads to a some kind of freeze situation with filesystem. UI itself seems to work but when accessing any file or folder things freeze. Mouse moves, UI gets updated (like a clock) but that’s it. Cannot even force shutdown from KVM but force reset (which is currently the only way I’ve found to get VM back operational). This led me to think the problem is related to filesystem.

Why host <> guest sharing could be the issue? Well, when transferring data from guest to host it stops. Also, another VM which does not have such sharing runs fine. This doesn’t happen immediately. It’s difficult to say how long it actually takes but several hours - most likely more than 10h. As my server is running 24/7 I cannot follow this all the time.

I’ve updates packages to the latest available on both host and guest side but no change. I’ve updated fstab mounting to the latest I’ve found (without a luck):

/data /data virtiofs defaults 0 2

It used to be:
/data /data virtiofs rw,noatime,_netdev 0 2

I’ve changed the cache mode for VMs virtual disk from “Hypervisor default” to “writeback”. No impact, as expected actually because this is not the troublemaker.

And before anyone asks - no, I have not made any change to settings before this started. It started after an update to both guest and host (I usually update guest and host at the same time). And this has been working for a long time. Same setup worked earlier before I had to create this VM again (about a year ago).

Cannot find anything from the net related to the problem. Maybe someone here has a clue what is happening and why?

christophocles · July 5, 2023, 9:11pm

I am having similar issues with recent tumbleweed updates. My host is running tumbleweed release 20230701 and my guest is running opensuse leap 15.4. On the host I have tried configuring the virtiofs shares through virt-manager using the built-in virtiofsd, and I have also tried using externally-launched virtiofsd-rust (version 1.4.0).

I have not identified what causes the guest to hang yet, but it seems somewhat random. Only one of the three virtiofs shares seems to hang up. It seems to happen after a few hours with heavy I/O, but I have not yet tested to see if will crash while idle.

When I say “hang”, I do not mean the entire guest OS - the UI remains responsive. I can open terminal and run “ls /mnt/virtiofs_share_1” just fine, but when I try “ls /mnt/virtiofs_share_2” the command hangs indefinitely. Also, the application that is using virtiofs_share_2 is completely frozen and cannot be used at all. I cannot unmount virtiofs_share_2 unless I use “umount -lf” which does not actually unmount it and applications using the share remain frozen. I cannot shutdown the system normally because the shutdown sequence hangs when it gets to the unmounting phase.

I have a Windows guest that is also using the same three virtiofs shares (with a separate instance of externally-launched virtiofsd-rust), and the shares are working properly there with no issues. When I force reboot the opensuse guest, the shares are working again with no issues until it hangs up again.

A couple months ago, all of this was running stable. All of this headache started fairly recently, and I am working to narrow down the root cause and eliminate it. Again, I suspect it’s an issue with the guest virtiofs driver, not the host. The virtiofs filesystem driver seems to be hanging up for this one share, and there is no timeout.

I would appreciate some help in further troubleshooting. It would be very helpful just to have an effective way to “force unmount / force kill” the virtiofs driver for the share that is hung up, so I can close my applications and reboot normally. When I am in this state, the only thing I can do is “force reset” from the host virt-manager.

I can post my full libvirt configuration or any other necessary information upon request. Thanks.

paju-21 · July 6, 2023, 5:48pm

Very similar symptoms in my case. Although, when the system hangs from I/O point of view I’m still able to restart the VM. It takes ages because of not being able to unmount share etc. but eventually after x timeouts it reboots and then it’s fine for few hours.

I have major issue with Win7 VM. It used be troublesome i.e. hanging itself and also the host but I got that improved much. Changed SATA to virtiofs (with SATA it crashed pretty much always) and reduced number of CPUs it started to work OKish. I was able to use needed win software. Yday I was trying to start the Win7 VM but no luck at all. After many different changes I noticed that by disabling CDROMs (win7 image and virtiofs drivers) the VM did not hang always the host but simply itself and after a while crashed. I was able to test various changes in CPU side and also network side but none made any difference - or at least I could not tell. In the end, I was not able to start Win7 VM even for once and gave up.

What always happens in this hang/freeze situation is following:

host and guest system is either completely frozen or almost frozen i.e. mouse pointer moves a bit every few seconds (5-10s). No clicks are going through i.e. not possible to close VM window for example.
HD led indications before freezing blinks according to access and few sec before freeze it lights up completely i.e. statically ON. No blinking afterwards which indicates heavy I/O traffic. There are no means to study this further since I’m not even able to log in via SSH to the host. Alt+SysRq+e is the salvation.

Based on all above I’m quite sure this is related to virtiofs one way or the other. Not sure if it’s guest or host or both that is causing this. During previous solving of Win7 freeze issue I noticed that disabling network (virtio as well) helped to start Win7 VM. A bit problematic though since network is oftentimes needed so had to enable it again. With the yday tests disabling NIC had no impact.

Regarding the original email about the problem there has been zero changes despite updates. I tried also changing window manager to Xfce if lower resource usage compared to KDE would make any difference but unfortunately had no impact. Expected but still hoped for a change.

paju-21 · July 6, 2023, 7:15pm

Based on instructions from OpenSuSe: Virtualization Guide | openSUSE Leap 15.5

A.3.6 Disable MSR for Microsoft Windows guests

options kvm ignore_msrs=1

Caused major issues with Linux VM. Didn’t even have time to test out windows VM so difficult to say if this is usable…

malcolmlewis · July 7, 2023, 8:57pm

@paju-21 Hmmm, not here with either Windows 10 or 11 Pro.

I have the following configured;

/etc/modprobe.d/10-kvm.conf

options kvm ignore_msrs=1
options kvm report_ignored_msrs=0

christophocles · July 8, 2023, 12:10pm

After some experimentation I may have found a solution. I read this post:

If virtiofsd is single threaded or does not have a thread pool just
to handle requests which can block for a long time, virtiofsd will
stop processing new requests and virtiofs will come to a halt.

This gave me an idea. I looked at the options for virtiofsd-rust and found this one:

--thread-pool-size <thread-pool-size>

Maximum thread pool size. A value of “0” disables the pool.

Default: 0.

I am using a script to launch virtiofsd when the guest OS is started.

Here is the XML configuration for the virtiofs share inside of virt-manager:

filesystem type=“mount” accessmode=“passthrough”
driver type=“virtiofs” queue=“1024”
source socket=“/tmp/virtiofsd-opensuse154-trafalgar”
target dir=“trafalgar”
alias name=“fs1”
address type=“pci” domain=“0x0000” bus=“0x0e” slot=“0x00” function=“0x0”
filesystem

I added --thread-pool-size=64 to the script:

virtiofsd --socket-path /tmp/virtiofsd-opensuse154-trafalgar --shared-dir /mnt/trafalgar --log-level trace --inode-file-handles=mandatory --posix-acl --thread-pool-size=64

That was about 3 days ago. The guest has been running stable and none of the shares are hung up.

hui · July 8, 2023, 12:17pm

maybe also
https://bugzilla.opensuse.org/show_bug.cgi?id=1212942

christophocles · July 8, 2023, 1:11pm

Oh, one other thing. You mention you have trouble with Windows guests using the virtiofs shares as well. Several months ago I was having a lot of issues with my Windows guest; the shares were disappearing for no apparent reason. I think it was running out of file descriptors when scanning through all the files on the large virtiofs shares. That was when I started using virtiofsd-rust on the host - it had some options that the base virtiofsd did not. This is the option that fixed my issue with the Windows guest:

--inode-file-handles=<inode-file-handles>

When to use file handles to reference inodes instead of O_PATH file descriptors (never, prefer, mandatory).

never: Never use file handles, always use O_PATH file descriptors.

prefer: Attempt to generate file handles, but fall back to O_PATH file descriptors where the underlying filesystem does not support file handles or CAP_DAC_READ_SEARCH is not available. Useful when there are various different filesystems under the shared directory and some of them do not support file handles.

mandatory: Always use file handles. It will fail if the underlying filesystem does not support file handles or CAP_DAC_READ_SEARCH is not available.

Using file handles reduces the number of file descriptors virtiofsd keeps open, which is not only helpful with resources, but may also be important in cases where virtiofsd should only have file descriptors open for files that are open in the guest, e.g. to get around bad interactions with NFS’s silly renaming (see NFS FAQ, Section D2: “What is a “silly rename”?”).

Default: never.

And my script:

virtiofsd --socket-path /tmp/virtiofsd-win10-T --shared-dir /mnt/trafalgar --cache never --inode-file-handles=mandatory

paju-21 · July 8, 2023, 4:34pm

Added both lines and now it works. Without the 2nd line to report it the linux guest had issues.

Also, I was able to start Win7 guest finally.

Thx for this.

paju-21 · July 8, 2023, 4:42pm

Haven’t had such issues because I’ve been using samba for sharing with windows and haven’t notices issues there. Although, I’m using network shares very very little from Windows guest.

This version of virtiofs written in rust seems to be default one in opensuse repository - or am I mistaking something?

paju-21 · July 8, 2023, 4:44pm

This is interesting! Didn’t have queue=“1024” parameter so it’s now added. But the starting parameters are not yet taken in use.

Can you share the script (and order & location from where it is executed)?

christophocles · July 10, 2023, 12:54am

Yes, it does look like virtiofsd is the default now. I checked ‘/usr/libexec/virtiofsd --version’ and it is 1.6.1 which is the latest available from the gitlab site. Six months ago, this was not the case. I needed the option ‘–inode-file-handles’ and it was not available in the default version, so I downloaded it manually from the gitlab site and used it in my scripts. It appears I will no longer need to do that. Thanks for pointing that out.

As for how the scripts are executed, I create a script named ‘start_viofs_all.sh’ in the location ‘/etc/libvirt/hooks/qemu.d/<vm_guest_name>/prepare/begin’.

‘start_viofs_all.sh’ starts the virtiofsd processes for each of the shared folders for each guest vm. The processes each run inside of a screen session. Permissions are set on the socket and pid files so that qemu can access them.

#!/bin/bash
screen -d -m /root/start_<vm_guest_name>viofs<share_name_1>.sh
screen -d -m /root/start_<vm_guest_name>viofs<share_name_2>.sh
screen -d -m /root/start_<vm_guest_name>viofs<share_name_3>.sh
sleep 2
chown qemu:qemu /tmp/virtiofsd-<vm_guest_name>*
chmod 0600 /tmp/virtiofsd-<vm_guest_name>*

Here is one of the scripts that actually runs the virtiofsd process, with the desired options, for one of the shares:

#!/bin/bash
touch /tmp/virtiofsd-<vm_guest_name>-<share_name_1>
chmod 0600 /tmp/virtiofsd-<vm_guest_name>-<share_name_1>
/usr/libexec/virtiofsd --socket-path /tmp/virtiofsd-<vm_guest_name>-<share_name_1> --shared-dir /mnt/<share_name_1> --log-level trace --inode-file-handles=mandatory --posix-acl --thread-pool-size=64

christophocles · July 10, 2023, 3:52pm

Alright, I tried using the built-in virtiofsd at ‘/usr/libexec/virtiofsd’ which is version “virtiofsd backend 1.6.1” and I found it to be unstable. The host processes were crashing under heavy disk I/O from the guest. I have switched back to version “virtiofsd backend 1.4.0” which has been working for me for the last several months.

paju-21 · January 31, 2024, 7:11pm

This problem is still persisting and I’m not very eager to go to old version of the virtiofs. Any idea if there is going to be official solution?

I was reading the official bug related to this but cannot find it anymore. The problem was found (related to file caching somehow under VM when accessing host directly - don’t recall exactly).

paju-21 · February 1, 2024, 6:07pm

Finally found the bug report: Freezing processes when accessing virtiofsd share (#101) · Issues · virtio-fs / virtiofsd · GitLab

Tumbleweed is using version 1.7.2-… and one comment that no problems with 1.8 version anymore. Latest being now 1.10.1. Two other versions also exist: 1.9 and 1.10.

Would be good to get update of this to Tumbleweed (get 5 months old release updated to 1 week old)…

malcolmlewis · February 1, 2024, 6:25pm

@paju-21 Hmmm not here… 1.10.1-1.1 when did you last do a zypper -vvv dup?

paju-21 · February 1, 2024, 6:43pm

1 or 2 days ago.

rpm -qa | grep -i virtiofs
virtiofsd-1.7.2-1.2.x86_64

Edit: interesting… Yast shows 1.7.2-1.2-x86_64 from opensuse but OSS repo shows 1.10.1-1.1-x86_64

arvidjaar · February 1, 2024, 6:48pm

bor@tw:~> zypper se -s virtiofsd
Loading repository data...
Reading installed packages...

S | Name      | Type    | Version    | Arch   | Repository
--+-----------+---------+------------+--------+-------------------------------
i | virtiofsd | package | 1.7.2-1.2  | x86_64 | (System Packages)
v | virtiofsd | package | 1.10.1-1.1 | x86_64 | openSUSE-20221216-0 (20240131)
bor@tw:~>

malcolmlewis · February 1, 2024, 6:51pm

The update was today, so zypper dup to today’s snapshot

paju-21 · February 1, 2024, 7:18pm

1.10.1 version updates as well as today’s other updates. Hopefully problem is finally gone.

Why zypper did not update this to more recent version when it was available on other enabled repositories?

Checked with zypper list-updates --all and there are several more recent versions - although mainly the last numbers after dash (none with such major version difference as in this case). 278 packages to be exact.

Edit: And no… problem exists. VM froze again. Maybe even faster than before.