libvirt/libxl: Kernel bug in netback.c causes VM to lose network connectivity!

Greetings!

For quite some time now I’m encountering repeated kernel failures that cause my primary domU to lose its network connectivity.
The kernel being used in dom0 is a 3.16.7-127.g5f448f7-xen from the kernel repo running on Xen 4.4.4_02-6.4, and the domU is running a 3.16.7-35-desktop kernel.

At first glance it seems as if this has something to do with virtual networking, and the receiving side on the virtual network is mentioned. When this happens it causes the CPU core currently dealing with virtual networking to lock up, thereby busting virtual network connectivity. However, there are also quite a few occasions that a second kernel also locks up wit a kernel bug when netback fails, but that seems to be irreproducible. Also the kernel in the VM continues to operate normally, the only thing that is lost is the virtual network, although doing ifconfig inside the VM still shows the network adapters to be there. It merely behaves as if a virtual cable has been disconnected.

Now, is there a way to grab the output on tty10 and save it to a file so I can cut and paste the trace in question, and how do I figure out what’s going on when the kernel actually throws the bug? I’m going to install the debug symbols for said kernel, but I have no idea how to run a debugger on a misbehaving kernel…

One more thing: Access to the VM is either done via XDMCP (here dom0 only acts as an X terminal) or can be done via virtual ttyS0 mapped to one of the ttys on the screen. Especially XDMCP produces a large volume of network traffic, thereby to seemingly somehow invoke the bug. However, things may run smoothly for several days until the bug is thrown, but it may also appear right after booting the VM.

Might be difficult to transport a file to another machine without networking. I can think of 2 possible tries…

If you have control of the Server, you can create a Shared Folder on the system so another machine (can be a Host or another Guest) can have access to the same folder. This uses the Plan 9 protocol but no networking

Instructions for setup
https://en.opensuse.org/User:Tsu2/virtfs#Overview

Alternate way…
Create a new virtual disk, then “attach” (using the Guest properties) and mount it in the Guest and copy any files to this virtual disk.
Then, either simulaneously or only after dis-connecting the disk, mount it in either a different guest or in the Host, you should be able to mount it on a standard loop device,
https://en.opensuse.org/User:Tsu2/loop_devices

Good luck, post if you have any follow up Q,
TSU

Seems like I’m slowly getting down to the core of the problem.
Now some things within libvirt or so seem to have been fixed, because when this problem occurs (something with the virtual network’s ring buffer) the error is caught and the virtual interface disabled by libvirt.
Normally the network runs just fine indefinitely, but when I start Firefox or Konqueror, things tend to go haywire. I don’t know why, but somehow these two have the tendency to send bogus data across the virtual network, thereby causing an error that eventually leads to the shutdown of the interface. When using chromium (like I’m doing now), however, the virtual interface is not killed.

From what you describe,
I doubt that any kind of problem like what you describe is within the Guest, but if you want to still filter and inspect system log entries, you can display the current journal with

journalctl

Or any previous boot, the following is an example of the immediately previous boot

journalctl -n 1

Similarly, you can inspect any entries that might exist in your HostOS.

So, you may not need to pass output to a tty to inspect connectivity issues.

I’d also recommend you investigate whether your BIOS/UEFI is fully updated, and that your CPU virtualization extensions are enabled.

TSU

I have been doing some research on the 'Net, though, and as it seems the problem is that certain bogus data seems to trigger this problem. Initially this has caused the kernel in Dom-0 to throw a bug and so kill the virtual network, but since recently this has been trapped, and when this situation occurs, libvirt takes down the affected interface.
From what I remember this problem seems to be linked to the window manager and XDMCP: When this problem showed up, I have been accessing Firefox one way or another, either by directly working with it, by manipulating Firefox’s window or even when the screen saver returned the screen to normal. For some so far unknown reason data was produced that couldn’t be handled by the virtual network and so produced the error. Since this problem doesn’t constantly show up this tells me that this happens under certain circumstances only, but until now I haven’t found a way to reproduce this behavior.

The other strange thing is that it seems to be only Firefox and the Flash player (instead of Konqueror it has been the Flash player that caused the error when watching a video) that produce such corrupt data and send it via XDMCP. With other programs I haven’t seen this problem occur at all.

Based on the little description you’ve provided,
I’d very highly suspect the culprit is Flash (I admit I’m heavily prejudiced whenever I hear this dreaded technology which is known to monopolize system resources crushing all other functionality while also at time ridden with vulnerabilities and its own “cookies” which unlike ordinary cookies can and has been exploited by the unscrupulous).

Because of the many things that Flash can do, you’d have to look deeper into what is happening in your scenario… If it’s worth your time.
Probably the better solution if possible is to just remove all Flash capabilities from your machine so as to just not support issues that Flash will cause.

TSU