Ceph rbd disk as backend for xen instances

eblock · April 7, 2016, 11:40am

Hi all,

is it possible to use a Ceph cluster as storage backend for xen instances (Openstack Liberty)? I’m not sure what it is yet, but I can’t launch instances with their root device in the rbd pool. I also tried to create a vm manually with xl create and this config:

disk= 'format=raw, vdev=xvda1, access=rw,backendtype=qdisk, target=rbd:images/0cd7201b-ea9e-459f-b603-a4c4ae93c601:id=openstack:conf=/etc/ceph/ceph.conf' ]

With this line I get following error in /var/log/xen/qemu-dm-test-rbd.log:

xen be: qdisk-51713: error: Unknown protocol
xen be: qdisk-51713: initialise() failed

I also tried to attach the disk as xvdb via

xl block-attach my-instance format=raw, vdev=xvdb, access=ro, backendtype=qdisk, target=rbd:images/my-image:ceph-user

The result is the same for all approaches, attaching fails:

compute1:/tmp # xl block-list my-instance
Vdev  BE  handle state evt-ch ring-ref BE-path
51712 0   16     4     12     8        /local/domain/0/backend/vbd/16/51712
51728 0   16     3     14     1010     /local/domain/0/backend/qdisk/16/51728

So it seems to be some problem with

backendtype=qdisk

, I guess. Has anyone faced something similar?
By the way, using Ceph as backend for cinder and glance works fine, I only need nova to work

Thanks!

tsu2 · April 8, 2016, 4:54pm

The first Q I’d be asking you is why your choice of Ceph?
AFAIK, it’s main advantage is to configure a storage cluster across widely dispersed geographical sites, without needing to install an OS. Is this really your need and objective, or are you operating entirely within a single geographical site? IMO Ceph would hardly be the most appropriate to use for VM storage and would likely require massive tuning (fencing) to ensure nodes with the required data are geographically close.

As for your actual situation, how did you define your quorum disk?
Here is a RHEL doc which should apply…
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Configuration_Example_-_Oracle_HA_on_Cluster_Suite/qdisk_configuration.html

Also, a related RHEL document describing “considerations” for selecting a quorum disk, note the reference to the result as a < shared block device>. So, although I haven’t set up exactly what you’re describing, it suggests for instance that similar steps to setting up something like iSCSI would be required… and would be seen by OpenStack in that way.
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/s1-qdisk-considerations-CA.html

TSU

jmozdzen · April 11, 2016, 3:00pm

Hi TSU,

from what I can tell, Ceph is well intended to run as a local cluster of inexpensive storage nodes? To me, Ceph looks more like some sophisticated “network RAID” storage solution.

The OP was looking into using rbd to provide block devices to virtual machines - I do not see any quorum disk requirement in that context? This works well for KVM instances, so the question is rather if the patches provided by Jim Fehlig (see i.e. [libvirt] [PATCH V2 3/4] xenconfig: support xl<->xml conversion of rbd disk devices) are “complete” in a sense that this should be working.

This makes me believe that you were seeing some totally different implementation. The OP’s scenario is starting a Xen VM and specifying that the DomU’s disk is provided via the “rbd” protocol. This is an alternative to the iSCSI approach…

Regards,
Jens

tsu2 · April 11, 2016, 4:26pm

Hello Jens,
My observation and evaluation of Ceph as a backend is based almost entirely on the lectures I heard from Sage Weil himself, the creator of Ceph. I’m not saying that Ceph can’t be implemented, I’m just suggesting what Ceph’s main unique capabilities are and asking whether it’s overkill for use purely as local site network storage. That said, there can be good reasons to implement Ceph as local site network storage, for instance if you envision eventually expanding beyond the local site or if you have on-site Admins who specialize in Ceph.

The OP’s post described how he first tried to configure qdisk, which is “quorom disk.”

Yes, they are not exactly the same but am pointing out that the Open Stack configuration likely can support any kind of method attaching block storage generally. So, if one method isn’t working, consider another. Also, OpenStack doesn’t provide storage “normal ways” specific to each type of virtualization technology, it implements a Cinder service that exposes storage in a standardized agnostic way so I doubt(although may be possible) that the problem is rbd if the OP is using standard OpenStack methods. Of course, it’s always also possible that the OP is deploying some custom version of OpenStack without Cinder, but that is another scenario unless the OP reveals that is what he did.

TSU

jmozdzen · April 11, 2016, 5:29pm

Hi TSU,

Thank you for sharing that information, the usual “marketing talk” typically goes along the lines of what the Ceph Advisory Board published (http://www.storagereview.com/ceph_community_forms_advisory_board):

(Highlighting by me) That’s why Eugen was asked to test Ceph as an alternative to a typical iSCSI solution for an Openstack cloud that he already had up & running.

I’m working with Eugen (the OP), who is working on prototyping an Openstack cloud environment and trying to implement different scenarios. First of all, when using KVM on the compute nodes, things work as expected (Cinder is able to define the rbd resource, the Glance image is copied over to the rbd “disk” and nova instanciates the KVM). That makes me conclude that it’s not rbd that’s throwing a fit.

When using Xen instead of KVM, the VM isn’t created because the virtual disk configuration is not acceptable to Xen (“unknown protocol”)(another pointer at a problem above the actual rdb layer). Looking around he found the patches by Jim to libvirt and the libxl connector, which were published and included only two months ago. So Eugen is currently working his way though individual package updates, to see if a bleeding-edge libvirt version will help. But it’d be good to know if anyone had been able to get this combination (Xen accessing an rbd-based virtual disk) working at all: It may well be that this code path is still under development.

So it’s not about seeking alternatives like iSCSI, but about the current state of affairs when Xen is the requested way to go and Ceph is the desired back-end. This is not the first time that Eugen faces implementation limits when it comes to using Xen - but it’s on his list, so he has to try to make it work.

Hope this puts some light on the background of this thread.

Regards,
Jens

tsu2 · April 11, 2016, 8:56pm

Jens-U. Mozdzen:

Hi TSU,

Thank you for sharing that information, the usual “marketing talk” typically goes along the lines of what the Ceph Advisory Board published (http://www.storagereview.com/ceph_community_forms_advisory_board):

(Highlighting by me) That’s why Eugen was asked to test Ceph as an alternative to a typical iSCSI solution for an Openstack cloud that he already had up & running.

I’m working with Eugen (the OP), who is working on prototyping an Openstack cloud environment and trying to implement different scenarios. First of all, when using KVM on the compute nodes, things work as expected (Cinder is able to define the rbd resource, the Glance image is copied over to the rbd “disk” and nova instanciates the KVM). That makes me conclude that it’s not rbd that’s throwing a fit.

When using Xen instead of KVM, the VM isn’t created because the virtual disk configuration is not acceptable to Xen (“unknown protocol”)(another pointer at a problem above the actual rdb layer). Looking around he found the patches by Jim to libvirt and the libxl connector, which were published and included only two months ago. So Eugen is currently working his way though individual package updates, to see if a bleeding-edge libvirt version will help. But it’d be good to know if anyone had been able to get this combination (Xen accessing an rbd-based virtual disk) working at all: It may well be that this code path is still under development.

So it’s not about seeking alternatives like iSCSI, but about the current state of affairs when Xen is the requested way to go and Ceph is the desired back-end. This is not the first time that Eugen faces implementation limits when it comes to using Xen - but it’s on his list, so he has to try to make it work.

Hope this puts some light on the background of this thread.

Regards,
Jens

Thx for describing your project, it does provide some context to what you are trying to configure.

If it still falls within your project parameters, I’d suggest that you consider setting up your storage access as objects instead of as raw block devices, Seems to me if you’re successful getting this to work with one virtualization technology, it’d be a lot more portable for use with other virt technologies.

Also,
If you have the time and resources, it might be nice if you could post a “cookbook” guide to setting up on openSUSE (assuming you’re doing so since you’re posting in this Forum), I’ve found that following the standard documentation still does not ensure a successful setup (although much better than a couple years ago).

Because of current problems setting up “full” OpenStack on any version of openSUSE, I’ve posted a solution that is almost 100% guaranteed to work which is to install a default Devstack which after installed can be extended and re-distributed to multiple machines and if desired changed to Neutron networking (from Nova networking). In other words, the Admin would get an assured working basic setup which can be modified to eventually resemble a “Full OpenStack” with extensive step by step modification.

Installing Devstack on any version of openSUSE takes minimal effort and fully working within a few hours using the scripts I created
https://en.opensuse.org/User:Tsu2/openstack-install

TSU

tsu2 · April 12, 2016, 6:19pm

Thinking about your situation further,
I suspect that it’s probable the problem isn’t any ancillary services like libvirt but fundamental support for rbd in the kernel you’re using.

I then started looking around for any command which would display kernel support for rbd and surprisingly found nothing. Unless you’re able to locate something, I have not found any evidence rbd support is reported anywhere like, for example /proc/cpuinfo. Because it’s built into the initrd and not as a Kernel Loadable Module, you also won’t find any “ko” module files.

So, I think that possibly the only way to determine support is as Eugen did, which is to use xlcreate to try to start up an instance using rbd. Eugen wasn’t clear in his original post, but his test should point to rbd storage with least complication, ie stored on the same machine. The ceph configuration should also be as simple as possible and not even necessarily point to a full system image, but should be at least a bootable partition.

If Eugen really did point to as simplified a configuration as possible, then I’d guess that the “unknown protocol” error really does mean that rbd support has not been built into the kernel, and anything else like libvirt is irrelevant.

Assuming this,
Then

You should submit a feature request for adding rbd support to the xen kernel to https://bugzilla.opensuse.org
You might try building your Xen kernel if you’re up to it, following official documentation
http://www.mad-hacking.net/documentation/linux/ha-cluster/storage-area-network/ceph-xen-domu.xml
As I suggested in my previous post, some type of alternate connection to storage might be possible… at least in the short run until kernel rbd support is added. Object storage is the most portable and versatile but is not as efficient and might still be an option on Ceph. I haven’t looked closely at using more common methods like NFS, I’m pretty sure it should be possible even on Ceph but likely means that you can connect only to Ceph storage on-site (not remote) which I assume is your scenario anyway.

This is an interesting topic to some degree… Personally, I strongly believe in direct-attached or high throughput connection storage like to a SAN and have paid less interest to network attached storage, but with dedicated high capacity Gig network connections can be made to work…

Good Luck,
TSU

eblock · April 13, 2016, 9:47am

Hi TSU,

thank you very much for digging into this topic, I really appreciate it!

The ceph configuration should also be as simple as possible and not even necessarily point to a full system image, but should be at least a bootable partition.

The Ceph setup I’m using is really simple, I’m new to this so I just wanted to get it working somehow. It’s a single Ceph node with two OSDs and one monitor on a virtual machine, basically that’s it.
First I tried to create a new VM booting from rbd, the xl config was adapted from the xml-config nova is showing in its nova-compute logs. Then I simplified the rbd test: I just tried to attach an image from rbd to a running instance via xl block-attach, but the result was the same.
I hope this clears up a little what I have tried so far.
Thanks for your suggestions, I will discuss them with Jens.

Regards,
Eugen

eblock · April 13, 2016, 3:03pm

Update:
I was able to create an instance from an existing rbd image via xl create, the disk parameter was:

disk='backendtype=qdisk,vdev=xvda,target=rbd:vms/7b1faf41-997b-4678-b406-2a6d689da782_disk:id=openstack:conf=/etc/ceph/ceph.conf']

.
Attaching a rbd volume to a running instance via xl block-attach also worked:

xl block-attach 4 format=raw, vdev=xvdb, access=rw, backendtype=qdisk, target=rbd:images/0cd7201b-ea9e-459f-b603-a4c4ae93c601:id=openstack

Somehow it didn’t work with virsh, I used the same disk for the instance, but the VM gets stuck in grub menu.

What I did to accomplish that was to upgrade libvirt from version 1.2.18 to 1.3.3, including Jim Fehlig’s patches for libxl. When libvirt was running, my colleague noticed that the qemu-block-rbd package was missing rbd support. He rebuilt the packages enabling rbd and that was it!
Unfortunately, nova still can’t deal with it, I see a strange behaviour: when I launch an instance that is supposed to have its disk in ceph, nova-compute reports this error:

libvirtError: internal error: libxenlight failed to create new domain 'instance-00000133'

and in /var/log/libvirt/libxl/libxl-driver.log:

2016-04-13 14:32:02 CEST libxl: error: libxl_device.c:284:libxl__device_disk_set_backend: Disk vdev=xvda failed to stat: rbd:images/5dd118fc-e037-4e07-bfb6-fef44796a050_disk:id=openstack:key=AQC4zANXFpHKCxAARNEQF9RnffcRlHkRbTTR0Q==:auth_supported=cephx\;none:mon_host=192.168.124.132\:6789: No such file or directory
2016-04-13 14:32:02 CEST libxl: error: libxl_create.c:913:initiate_domain_create: Unable to set disk defaults for disk 0

Now the strange part is that during the launch process I see the respective disk in the ceph cluster, but it only exists for about two seconds, then it disappears. This would explain why libxl can’t find it, but I don’t know why the disk disappears. I tried to increase the debug log level for ceph (for mon), but I couldn’t find helpful information. Do you have any idea where to look at or what could cause this behaviour?

Regards,
Eugen

eblock · April 13, 2016, 4:46pm

Update 2:

Now I’m able to launch an instance with nova boot. Although it’s not exactly how I expected it to work, but at least it does. I had to boot it from volume using cinder, which I had configured to use ceph, too. And I had to edit the nova driver.py to use a different bootloader for xen guests:

compute1:~ # diff -u /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py.dist /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py
--- /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py.dist    2016-04-08 09:48:26.000000000 +0200
+++ /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py        2016-04-13 16:11:56.026470037 +0200
@@ -4395,6 +4395,10 @@
             if guest.os_type == vm_mode.EXE:
                 guest.os_init_path = "/sbin/init"

+        if CONF.libvirt.virt_type == 'xen':
+            guest.os_kernel = '/usr/lib/grub2/x86_64-xen/grub.xen'
+
     def _conf_non_lxc_uml(self, virt_type, guest, root_device_name, rescue,
                     instance, inst_path, image_meta, disk_info):
         if rescue:

Is this somehow related to rbd layering and copy-on-write cloning? I assume that cinder and nova use different approaches how to access rbd images, I’ll have to dig a little more.

tsu2 · April 13, 2016, 4:49pm

If you’re “getting stuck” displaying the grub menu, then you’re not reading the boot directory or partition.
If you’re getting the grub menu but are unable to proceed further, then you’re having problem reading the root and/or /home partitions.

The first error is a fundamental error reading the disk, you need to consider mbr/gpt/similar issues.
The second error I described above is likely most easily addressed by creating an image that doesn’t have a separate /home partition (also, by default the boot files are in a directory and not a separate partition), which is something you should be doing anyway if your virtual disk is small (most likely when you’re deploying single-use instances).

At the moment not sure why virsh (which means you are using libvirt to deploy, manage, destroy instances) should cause an rbd-specific error, or should “disappear” virtual disks…

HTH,
TSU