Does Xen host (VMM+Dom0) work on Leap 42.1 at all?

SamsonovAnton · March 15, 2016, 6:19pm

When I upgraded to openSUSE 42.1 on one machine with classic BIOS boot and GRUB2, it quickly turned out that Xen Dom0 does not boot with the newer 4.1 Linux kernel. However, trying it with an older 3.16 kernel left from openSUSE 13.2 did the trick, and it was acceptable for me (although I should take care now to not to purge that kernel occasionally).

Today I upgraded another machine, installing from scratch in UEFI mode, with GRUB2-EFI which chainloads to xen.efi, and it appears to crash right after the Dom0 kernel receives control: at first, Xen prints some messages in 40x25-era font, then screen goes blank, and after some seconds the computer reboots. Manually adding the 3.16 kernel to the list of Xen options makes no difference.

So the question is: is it just me, or Xen Dom0 is globally broken in openSUSE Leap 42.1 as of now? Xen is not a common topic here, I see, which makes me wonder whether such setups are tested at all ([513957], #936418](https://bugzilla.opensuse.org/show_bug.cgi?id=936418), #967862](https://bugzilla.opensuse.org/show_bug.cgi?id=967862), #912566](https://bugzilla.opensuse.org/show_bug.cgi?id=912566)). If anyone has enough luck to use it successfully, please let me know.

tsu2 · March 16, 2016, 3:09pm

Although I rarely touch Xen nowadays,
Probably the big question is how you upgraded your machines to LEAP (online using “zypper dup” or offline using a DVD?).

If you did an offline upgrade, it might have been critically important to update your system immediately before a reboot using the following

zypper up

An online upgrade probably wouldn’t need an update if you included the oss update repo during the upgrade… else you’d again have to update your system manually as a separate step.

TSU

SamsonovAnton · March 16, 2016, 4:53pm

As for the current issue on a UEFI machine (with Leap 42.1 installed from scratch and Xen added later online by YaST), it appears that there is a bug in either Xen itself or Linux kernel related to EFI, because the very same images can be successfully started from classic CSM mode. I filed a report: #971386](https://bugzilla.opensuse.org/show_bug.cgi?id=971386).

After seeing that serial console is indeed an invaluable logging tool for such cases, no doubt I should apply that technique to the previous (legacy BIOS) install as well to see why cannot it boot with 4.x kernels. Perhaps the log will give some clues on where the problem actually stems from. I will update as soon as I have new info.

But I still would like to know out-of-the-box experience of other users of Xen on Leap 42.1. Does it work for you? Which adjustments did you need to make, if any?

eblock · April 7, 2016, 2:53pm

Hi,

I was able to deploy successfully an Openstack Liberty environment consisting of three nodes (1 controller, 2 compute nodes as Xen host), all of them running on Leap 42.1. But I didn’t upgrade or anything, I installed Leap from scratch on the compute nodes. So I can’t tell what you did “wrong”, but I can confirm that Leap works as a Xen host.

Good luck!

tsu2 · April 13, 2016, 5:12pm

A FYI -
As of about 6 months ago when I was exploring using openSUSE as host for a Liberty OpenStack, one of the many things I ran into was that I could use LEAP on Compute nodes (confirming eblock’s post) but could not for netowrk and controller nodes, I had to build them on 13.2… due to missing and unavailable dependencies. I don’t know if that has changed, but when I reported these missing dependencies, first attempts to build them failed.

TSU

eblock · April 18, 2016, 3:01pm

Hi,

I could use LEAP on Compute nodes (confirming eblock’s post) but could not for netowrk and controller nodes, I had to build them on 13.2… due to missing and unavailable dependencies. I don’t know if that has changed

I can confirm that it has changed. Until last week I had a 4 node environment (1 controller, 2 computes, 1 storage node) based on Liberty, then I upgraded to Mitaka, and all my nodes run on Leap.

Regards,
Eugen

SamsonovAnton · May 28, 2016, 5:28pm

Status update.

Regarding the UEFI machine, as can be seen in the mentioned ticket, SUSE developers thought that it was UEFI firmware bug (rather than Xen / Linux issue) associated with EFI runtime services, because the backtrace clearly showed that the penultimate function was efi_rs_enter( ), while the topmost frame was pointing at address 0x0…8 — as if someone in UEFI code wanted to call an internal function located at somePtr + sizeof(uintptr_t), but used NULL for somePtr. It was proved by the fact that starting Xen with “efi=no-rs” allowed to bypass the issue, although at the price of not seeing some early messages. However, after updating the kernel from 4.1.15 to 4.1.20 that trick stopped working — even with 4.1.15 — and recent update to 4.1.21 did not improve things. As for reporting the bug to the hardware manufacturer, I could not find a way to contact Dell for such issues; a message posted at community forums went unnoticed.

Regarding the machine with BIOS only, it was a mistake to think that booting Xen 4.5 from openSUSE 42.1 with a 3.x kernel from openSUSE 13.2 did help anything, because such a combination does not actually allow to start VMs due to mismatch between the hypervisor and its management tools in Dom0. Inspired by the serial console logging technique being so fruitful, I was expecting a quick result this time again, but in vain. Unfortunately, that machine did not have an onboard RS-232 port (not even an internal header), so I had to use PCI addon cards — luckily, Xen supports that, unlike with USB-to-RS232 adapters. But each time I tried, the log file captured by the remote machine was incomplete: it was always cut off after some 10 to 15 kilobytes from the startup, although no communication errors ever happened during test transmissions from fully booted system. At first I thought that the data rate simply exceeds the line bandwidth, and tried to play with baud rate, but it turned out that logging is actually synchronous: if the speed was lowered from 115200 to 9600 or even 2400 bits per second, the whole booting process just got slower, but no data was dropped — until the very same limit of a dozen kilobytes. That also showed that the high speed was not the cause of interruption, although taking into account how unreliable RS-232 generally is, I would not be surprised if some line noise was the one to blame. I have tried 2 different PCI cards with distinct RS-232 chipsets, as well as 2 different remote machines with different southbridges (NM10 and H81), — no luck. I would think that it may be Xen’s own fault that the serial logging is actually stopped at the transmitter side, but the earlier experience shows that such logging indeed continues up to the end.

Today I had a sudden insight about the former machine. During investigation of text-mode YaST crashes (#972783](https://bugzilla.opensuse.org/show_bug.cgi?id=972783)), I found out that some packages were newer than their mainline versions offered in Leap 42.1 Update repository. That was because I tried to “preview” Tumbleweed several months ago: although I did not actually perform the switch and just reverted the repository settings after realizing that such an “upgrade” would actually downgrade most software titles, perhaps something went wrong and several packages got switched anyway (some months later I repeated the procedure on another machine and did not have any side effects). One of those messed packages was Xen: while xen-tools stayed with version 4.5.2, xen hypervisor itself got upgraded to 4.6.0. As switching the packages back to their mainline versions did solve the problem with YaST, I also hoped it would fix Xen, but it did not. Even after forcibly reinstalling all Xen packages, the kernel hangs or at least stops printing messages very early — 1.5 seconds after it starts, and I still cannot fully capture even that short period of time over Xen virtual console redirected to serial port.

tsu2 · May 29, 2016, 3:35am

When you switched from TW to LEAP, you should have run a “zypper dup” after removing all your TW repos, which should have ensured package version consistency. If you ran a “zypper up” it would have been insufficient particularly because you needed packages to be downgraded.

In fact, if you suspect you might still have version inconsistencies anywhere in your system, you should run a “zypper dup” immediately for the extra assurance those types of problems are addressed.

I’d be surprised that anyone would write their logging synchronized, it’s unusual enough to wonder if there is a special reason that was done or just an oversight. You may want to try asking some Xen experts whether you might have implemented logging properly or if they really meant the logging you’re doing to work that way.

TSU

SamsonovAnton · March 5, 2017, 2:25pm

Status report.

After updating BIOS machine to openSUSE 42.2 right upon its release, the new Xen version 4.7 started working just fine without any manual intervention. (Unfortunately, it also started exhibiting SATA connectivity issues not present in the bare-metal sessions, randomly forcing the entire filesystem to go read-only. But that is another story.)
After updating UEFI machine to openSUSE 42.2 recently, the new Xen regained its ability to function without EFI runtime services (“efi=no-rs
”) at least.

So, I assume mission accomplished — it looks like openSUSE 42.1 was simply “not very lucky” for Xen.

However, despite Linux kernel 4.4 shipped with openSUSE 42.2 claims to support PVH mode (CONFIG_XEN_PVH=y), booting Dom0 with “pvh=1 dom0pvh=1” results in a hangup. Perhaps PVHv2 (“dom0=pvh”) will make a difference, but we are unlikely to see it until openSUSE 42.3.