HA cluster from 2 servers with xen virtual machine - what may you recommend?

I am trying to setup a high availability cluster using 2 servers with opensuse xen solution. Is it an effective way?

Can you recommend this direction and is there some resource with deeper details?

Or would you prefer to use XCP from xen.org cloud solution? XCP seems to be so advanced, but when you read manuals, they insist to use only “homogenous” servers - with processors of same model and parameters - something bit rare in real life.

Thank you in advance for any recommendations, advices and experiences,

Hynek

I do not know how to answer your question as it sounds beyond what we are trying to do here and more like something your could do with SuSE Linux Enterprise Server (SLES) I would start looking here I think: Linux Server | SUSE Linux Enterprise Server

Also, I did find an interesting read with a SuSE and HP document you can find here: http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-5475ENW.pdf
Thank You,

James,
thank you very much for your answer and link to HP document. Yes, SLES might be more relevant for HA solutions and we used it for several years as well. The reasons why we are trying to develop a solution based on OpenSUSE are two: 1) our server supports academic nonprofit users at university with a group of young dynamic people with strong sympathies for free software initiatives. 2) OpenSuse appears to be more more lively than SLES and it’s repositories contain a group of very new and promising software packages (like Open Nebula etc.). It is interesting that any cloud HA solution based on OpenSuse has not been described yet.
Thank you,
Hynek

Recently had to accomplish a similar task — to create a cluster for web application which is database-backed (quiet a common scenario). This setup can be split into 4 layers:

  1. multiple instances of infrastructure services (DNS, mail) with failover support based on multiple DNS entries, although this can also be load-balanced (see next layer);]redundant load-balancer (the entry point for incoming requests);]multiple instances of a web server;*]clustered database.
    This scheme can be implemented on any number of nodes, starting with 2; you may configure everything on a single machine, and then just duplicate all VMs.

Assume we have the following address plan:

Layer 2. Setting up NetBSD paravirtual DomU for high availability and load-balancing

With Xen virtualization, you can employ almost any guest system for any of those layers, including openSUSE 11.x and 12.x, but my opinion is that NetBSD is the easiest and most lightweight solution for the load-balancer part, as it has built-in support for CARP protocol at kernel level (user-space daemon, uCARP, is also available) which will provide failover support for load-balancers themselves and common virtual IP address. NetBSD occupies less disk space (no more than 1GB), consumes less memory (64MB will be more than enough), and boots in a matter of seconds. The other component of this solution is HAProxy, which will periodically check HTTP and MySQL nodes for availability (although any generic TCP service can be configured, too) and distribute requests between live nodes.

First, you need to install NetBSD in paravirtualized mode as usual, using netbsd-INSTALL_XEN3_DOMU kernel, then normally boot with netbsd-XEN_DOMU kernel. Keep in mind that paravirtualized NetBSD supports Xen console only, not a graphical framebuffer.

Because CARP support is not compiled into kernel by default, and because HAProxy comes as a source port rather than a binary package, you need to create a separate NetBSD VM with more disk space (4GB) and more memory (256MB, for example) to be able to compile from sources without cluttering your production VMs. Inside that auxiliary VM, install kernel sources and pkgsrc (refer to NetBSD documentation).

To create a CARP-enabled kernel, chdir to /usr/src/sys/arch/amd64/conf (replace amd64 with your architecture) where XEN3_DOMU configuration file is located, and create a new file XEN3_DOMU_CARP with following contents:

include "arch/amd64/conf/XEN3_DOMU"
pseudo-device carp

Then compile as instructed; this will take a while. You need not to install the built kernel, but to upload it to Dom0 as netbsd-XEN_DOMU_CARP (using FTP, for example). Then add a line to /etc/sysctl.conf in production DomU:

net.inet.carp.allow=1

And create /etc/ifconfig.carp0 as well:

create
vhid 10
advbase 15
advskew 0
pass MySuperPassword
carpdev xennet0
inet 10.0.0.20
netmask 255.255.255.0

The next time you boot this DomU, it will create carp0 network interface and will be available at 10.0.0.20 virtual IP address.

Then build HAProxy, install it to the same DomU and enable it by adding a line to /etc/rc.conf:

haproxy=YES

Then write your configuration file to /usr/pkg/etc/haproxy.cfg:

global
        log 127.0.0.1   local0
        log 127.0.0.1   local1 notice
        maxconn 128
        chroot /var/chroot/haproxy
        user nobody
        group nobody
        daemon
        node balancer1

defaults
        log     global
        mode    tcp
        retries 3
        option  redispatch
        maxconn 64
        #timeout connect 5000
        #timeout client  50000
        #timeout server  360000
            # You may need to allow large timeouts if your application makes prolonging requests,
            # especially on initial setup stage, when database is created and populated.

listen  Web-HTTP
        bind *:80
        bind *:443
        balance source
        #mode http
            # Enabling HTTP mode will inject cookies into client sessions, which is not always needed.
        option  httpchk /lifecheck
            # You must create an empty file "lifecheck" in web-server's document root.
        server  web1    app1.example.com       check port 80 inter 15000
        server  web2    app2.example.com       check port 80 inter 15000
            # Adjust checking intervals to your needs.

listen  DB-MySQL
        bind *:3306
        balance source
        option  mysql-check user lifecheck
            # You must create a user named "lifecheck" (at "%" host) with empty password,
            # otherwise MySQL will block this machine after several connection attempts.
        server  mysql1    data1.example.com       check inter 15000
        server  mysql2    data2.example.com       check inter 15000

The next time you boot this DomU, the proxy should be operational. When a client types [NOPARSE]http://app.example.com[/NOPARSE] in address bar, browser actually connects to balancer.example.com (to either balancer1 or balancer2, depending on which currently own their common virtual IP), and then the request is proxied to either app1.example.com or app2.example.com. In the same way, web application will dial data.example.com and be forwarded to either data1.example.com or data2.example.com.

Layers 3 and 1. Clusterization specifics for applications

Most [web] applications will probably work in a clustered environment out-of-the-box. The only obvious obstacle is file-backed (or any other node-local means) storage for session data: when a node goes down and clients are diverted to another node, they will need to re-authenticate, which sometimes result in that their previous activity is lost. You should configure the application (or its framework, like PHP) to use database-backed sessions.

If the application collects e-mail from POP3/IMAP server, it should check several mailboxes, one per mail server node. Alternative solution is to use a mail server which stores all data in a database rather than in a local filesystem.

Unfortunately, HAProxy doesn’t support UDP balancing, so it can’t help with DNS failover. Not a disaster, as long as you configure every machine to use all DNS servers — standard round-robin timeouts will do the trick.

Layer 4. Clustered database

You need a special cluster DBMS (synchronous multi-master) for true high availability. Simple asynchronous master-slave replication offered by most non-commercial DBMS won’t do it, as it introduces possibility of either data loss or mutually conflicting updates, and requires manual switching of master node (or write your own automation tools), as well as modifying your application to implicitly support “single dynamic master for writes, multiple slaves for reads” architecture. It seems that the only freely available DBMS with native clusterization is MySQL (see mysql-cluster-* packages for openSUSE). There are some third-party solutions for PostgreSQL, but they appear to be SQL proxies which emulate synchronous updates on standalone slaves, with no integrity guarantee. Evaluational versions of well-known commercial DBMSes come without clusterization capability and implicitly prohibit any HA usage.

MySQL Cluster relies on NDBCluster storage engine, while mysql system database with authentication data is still stored locally on each SQL-processing node. NDBCluster engine has additional limitations on functionality — besides of those limitations imposed by conventional MySQL engines: for example, an application that I tried needs FOREIGN KEY constraint, which is supported in MyISAM and InnoDB engines, but not in NDBCluster. You may need to convert TEXT blob fields to VARCHAR(n) if you need them indexed, and limit the total index size to 3072 bytes. You will surely need much-much more RAM, as all tables are processed in memory. Yet the performance is likely to be veeeeeery low, because each operation must be securely confirmed amongst all nodes. You should probably dedicate an isolated network segment (VLAN, using separate physical network interface of highest speed) for communication between MySQL cluster components.

Setting up MySQL Cluster for the first time is cumbersome, so I will try to overview it later when I have time.

As promised, the rest of the story — about MySQL Cluster setup.

First of all, MySQL Cluster is a “whole new thing” as compared to traditional standalone MySQL server. But once you get the basics, it won’t be looking too different to you.

Components of MySQL Cluster

In contrast with the universal clusterization technique described above (when you take a bunch of independent nodes with “dumb” application and wrap them with HAproxy), all database server nodes must be highly aware of each other and action in a well coordinated manner. MySQL Cluster accomplishes this by splitting its functionality into several components:

  1. management node (“MGM
    ”) — a server that provides overall control of other nodes: informs them of each other, supplies them with configuration data, sends them shutdown signals and so on;]data-storage node (“NDB”) — a server that actually stores a replica of the data (or a partition of a replica, if data is partitioned) in its own complex format both on disk and in memory;]SQL-processing node (“SQL” or “API”) — a frontend server, based on traditional MySQL, which processes SQL queries from your applications; it stores metadata only, as all database contents is maintained on NDB nodes; additionally, SQL nodes are capable of storing non-clustered databases locally, as usual, — one example of which is the “mysql” system database that is always local — consider this then setting up authentication.Each node can run on a dedicated [virtual] machine, or share the same system with nodes of other type (but there is no point in putting more than one node of a type on the same physical host). This scheme allows for scaling to any number of nodes for each node type: for example, you can have 2 management nodes, 2 or 3 data-storage nodes, and 10 request-processing nodes — if you find that the processing power is the bottleneck in your scenario. Note, however, that multiple NDB nodes provide redundancy only with no performance gain, unless you partition your data, so that each group of NDB nodes handle their independent part of a complete dataset (no special advice on that).

MGM nodes use very few system resources and can run on a very modest machine. NDB nodes require much RAM — times much more than one could expect from looking at configuration values: for example, if I remember correctly, for a data size of 80 MB and index size of 32 MB, something about 1 GB of RAM is required on each startup (or even more, depending on other tunable attributes), although it then shrinks to 500–700 MB for normal work. Disk spendings, in contrast, are unnoticeable, provided today’s volume sizes. More important thing to worry about is the network performance between NDB nodes, as they have to act in sync and mutually confirm each step they do. Obviously, SQL nodes are all about CPU power, but a proper amount of RAM should also be present, as well as good network connection — because the data is no more “local” to the SQL node.

In a basic setup with 2 physical hosts, you can either create 3 virtual machines per host, each one holding its own component, or just a single VM that will run all components at once. Each way has its advantages and drawbacks:

  • Having every component in a separate VM:[LIST]]facilitates proper startup and shutdown, as the components must be initialized and stopped in given order (MGM, then NDB, and SQL at last);]allows for independent management of components and their OSes;
    *]Combining all components in a single VM:
    ]saves greatly on RAM — not only because of removing OS overhead, but also because free memory pool is available to all components;]facilitates software update;
    [/LIST]Aside of that, there is no difference on which path you chose. The only exception is that in second case you will have to adjust init-scripts to reflect service dependencies, while in first case you will have to manage the startup order and delay of virtual machines.

Installation

openSUSE provides pre-built packages named “*mysql-cluster-**”, so this is quiet straight-through. The point that is not-so-straight is that init-scripts are provided for the mysqld only — as far as I understand, no “official” scripts exist: not from Oracle, not from SUSE, not from any other *nix distribution; it’s assumed you have to write them yourself. I’m not a developer, just took the skeleton template from openSUSE 12.1:

#!/bin/sh
#
#     MySQL Cluster :: management node (cluster supervisor)
#
#     based on
#     Template SUSE system startup script for example service/daemon FOO
#     Copyright (C) 1995--2005  Kurt Garloff, SUSE / Novell Inc.
#
# /etc/init.d/ndb_mgmd
#   and its symbolic link
# /(usr/)sbin/rcndb_mgmd
#
### BEGIN INIT INFO
# Provides:          ndb_mgm
# Required-Start:    $local_fs $syslog $network $remote_fs
# Should-Start:      $time
# Required-Stop:     $local_fs $syslog $network $remote_fs
# Should-Stop:
# Default-Start:     3 5
# Default-Stop:      0 1 2 6
# Short-Description: MySQL cluster manager
# Description:       MySQL Cluster management-node service
### END INIT INFO


# Check for missing binaries (stale symlinks should not happen)
# Note: Special treatment of stop for LSB conformance
NDBMGMD_BIN=/usr/sbin/ndb_mgmd
test -x $NDBMGMD_BIN || { echo "$NDBMGMD_BIN not installed"; 
        if  "$1" = "stop" ]; then exit 0;
        else exit 5; fi; }

NDBMGMD_DIR=/var/lib/mysql-cluster
NDBMGMD_CFG=${NDBMGMD_DIR}/config.ini


# Source LSB init functions
# providing start_daemon, killproc, pidofproc, 
# log_success_msg, log_failure_msg and log_warning_msg.
. /etc/rc.status

# Reset status of this service
rc_reset


case "$1" in
    start)
        echo -n "Starting MySQL-MGM "
        /sbin/startproc $NDBMGMD_BIN -f $NDBMGMD_CFG --config-dir=$NDBMGMD_DIR
        rc_status -v
        ;;
    stop)
        echo -n "Shutting down MySQL-MGM "
        /sbin/killproc $NDBMGMD_BIN
        rc_status -v
        ;;
    try-restart|condrestart)
        if test "$1" = "condrestart"; then
                echo "${attn} Use try-restart ${done}(LSB)${attn} rather than condrestart ${warn}(RH)${norm}"
        fi
        $0 status
        if test $? = 0; then
                $0 restart
        else
                rc_reset        # Not running is not a failure.
        fi
        rc_status
        ;;
    restart)
        $0 stop
        $0 start
        rc_status
        ;;
    force-reload)
        echo -n "Reload service MySQL-MGM "
        /sbin/killproc -HUP $NDBMGMD_BIN
        rc_status -v
        ;;
    reload)
        echo -n "Reload service MySQL-MGM "
        /sbin/killproc -HUP $NDBMGMD_BIN
        rc_status -v
        ;;
    status)
        echo -n "Checking for service MySQL-MGM "
        /sbin/checkproc $NDBMGMD_BIN
        rc_status -v
        ;;
    *)
        echo "Usage: $0 {start|stop|status|try-restart|restart|force-reload|reload}"
        exit 1
        ;;
esac
rc_exit
#!/bin/sh
#
#     MySQL Cluster :: data-storage node
#
#     based on
#     Template SUSE system startup script for example service/daemon FOO
#     Copyright (C) 1995--2005  Kurt Garloff, SUSE / Novell Inc.
#
# /etc/init.d/ndbd
#   and its symbolic link
# /(usr/)sbin/rcndbd
#
### BEGIN INIT INFO
# Provides:          ndb
# Required-Start:    $local_fs $syslog $network $remote_fs
# Should-Start:      $time ndb_mgm
# Required-Stop:     $local_fs $syslog $network $remote_fs
# Should-Stop:       ndb_mgm
# Default-Start:     3 5
# Default-Stop:      0 1 2 6
# Short-Description: MySQL cluster storage engine
# Description:       MySQL Cluster storage-node service
### END INIT INFO


# Check for missing binaries (stale symlinks should not happen)
# Note: Special treatment of stop for LSB conformance
NDBD_BIN=/usr/sbin/ndbd
test -x $NDBD_BIN || { echo "$NDBD_BIN not installed"; 
        if  "$1" = "stop" ]; then exit 0;
        else exit 5; fi; }


# Source LSB init functions
# providing start_daemon, killproc, pidofproc, 
# log_success_msg, log_failure_msg and log_warning_msg.
. /etc/rc.status

# Reset status of this service
rc_reset


case "$1" in
    start)
        echo -n "Starting MySQL-NDB "
        /sbin/startproc $NDBD_BIN
        rc_status -v
        ;;
    stop)
        echo -n "Shutting down MySQL-NDB "
        /sbin/killproc $NDBD_BIN
        rc_status -v
        ;;
    try-restart|condrestart)
        if test "$1" = "condrestart"; then
                echo "${attn} Use try-restart ${done}(LSB)${attn} rather than condrestart ${warn}(RH)${norm}"
        fi
        $0 status
        if test $? = 0; then
                $0 restart
        else
                rc_reset        # Not running is not a failure.
        fi
        rc_status
        ;;
    restart)
        $0 stop
        $0 start
        rc_status
        ;;
    force-reload)
        echo -n "Reload service MySQL-NDB "
        /sbin/killproc -HUP $NDBD_BIN
        rc_status -v
        ;;
    reload)
        echo -n "Reload service MySQL-NDB "
        /sbin/killproc -HUP $NDBD_BIN
        rc_status -v
        ;;
    status)
        echo -n "Checking for service MySQL-NDB "
        /sbin/checkproc $NDBD_BIN
        rc_status -v
        ;;
    *)
        echo "Usage: $0 {start|stop|status|try-restart|restart|force-reload|reload}"
        exit 1
        ;;
esac
rc_exit

Put them to /etc/init.d (you can name them ndb_mgm and ndb, for example), then chmod a+x them and go to YaST service manager to enable them, or use chkconfig for this. Creating /usr/sbin/rc* symlinks is optional, as far as I understand.

You will also have to add these lines to /etc/init.d/mysql header, if all components are combined together:

# Should-Start: ndb_mgm ndb
# Should-Stop: ndb_mgm ndb

Then you should rebuild the service dependence tree.

Configuration

MySQL Cluster components are configured in a special way, with a mixture of node-local configuration files and a cluster-supplied data.

First, you need to configure MGM node by creating /var/lib/mysql-cluster/config.ini (the path is specified by NDBMGMD_DIR variable in ndb_mgm init-script):

[ndb_mgmd default]
DataDir=/var/lib/mysql-cluster   # May be the same as config-dir, unless you have security considerations.


[ndbd default]
NoOfReplicas=2   # For a dual-NDB setup.
DataDir=/var/lib/mysql-ndb

DataMemory=80M
IndexMemory=16M
# MaxNoOf......=...   # Lots of limits may be increased here.


[mysqld default]


[tcp default]



[ndb_mgmd]
NodeID=1
HostName=data1.example.com   # For combined nodes.
#HostName=mgm1.example.com   # For separate nodes.

[ndb_mgmd]
NodeID=2
HostName=data2.example.com
#HostName=mgm2.example.com


[ndbd]
HostName=data1.example.com
#HostName=ndb1.example.com

[ndbd]
HostName=data2.example.com
#HostName=ndb2.example.com


[mysqld]
HostName=data1.example.com

[mysqld]
HostName=data2.example.com

[mysqld]
# Spare room for additional temporary SQL nodes in test environment.

All you have to do on NDB node, is to add those lines to /etc/my.cnf (this file can be shared with SQL node):

[mysql_cluster]

# on first node
ndb-connectstring=mgm1.example.com;mgm2.example.com
# on second node
#ndb-connectstring=mgm2.example.com;mgm1.example.com

NDB will load its main configuration from MGM, as specified in “ndb default” section of config.ini. (Don’t forget to erase /var/lib/mysql-cluster/ndb_1_config.bin.* binary cache after changing textual config.ini.)

In contrast, SQL nodes are mainly configured by the usual /etc/my.cnf:

[mysqld]

ndbcluster

# on first node
ndb-connectstring=mgm1.example.com;mgm2.example.com
# on second node
#ndb-connectstring=mgm2.example.com;mgm1.example.com

ndb-wait-setup=20

default-storage-engine=NDBCLUSTER

Note that ndb-wait-setup option instructs an SQL node to wait more time before complaining about NDB nodes not being ready — it’s because SQL nodes are eager to start serving requests as soon as possible (remember, they can still handle standalone databases), while NDB nodes become available only after a thorough convergence process which could take a significant amount of time to complete. However, raising this timer higher than 30 seconds will cause mysqld’s own watchdog to fire from init-script, and SQL node service will be considered not running. Luckily, it appears to be that SQL nodes retry their attempt in a matter of minutes; the only downside is that a nasty warning will be logged.

Startup and shutdown

First, start the ndb_mgm service; it should not take much time. Then, launch ndb (a message like “Angel #3 connected” will be printed to console) and wait for them to synchronize — you will probably see intense HDD LED blinking during this. Finally, mysql service can be started. To verify the current state of your cluster, login to an MGM node and enter command as unprivileged user: ndb_mgm -e show (this time, ndb_mgm is not an init-script, but a /usr/bin utility) — you should see that all MGM, NDB and API nodes are connected, and that each NDB nodegroup has its master (there will be only 1 nodegroup if data isn’t partitioned).

Services are best shutdown in the reverse order, or by command: ndb_mgm -e shutdown. But I see no problems with stopping services in arbitrary order, as all components always try to inform their neighbors about their disassociation.

(to be continued in the next message due to size limitations on this forum)

(continuing)

Converting your applications for NDB

As I already said, NDBCluster engine limits MySQL functionality even further as compared to traditional MySQL engines like MyISAM and InnoDB.

  • Some of those limitations can make NDB really unsuitable for your application, such as the lack of FOREIGN KEY support in my example.*]Other limitations relate to field type and size, as well as overall record size and index size. You can try to convert TEXT to VARCHAR(n), decrease the size of a VARCHAR(n) field, convert a “prefixed” index to a “full-value” one, or maybe drop an index at all (this would hurt performance); yet again, don’t forget that UTF-8 considers each character equal to 3 bytes. And don’t be surprised if you encounter table schemas that require record size of much more than 14’000 bytes (simply by having incredible number of relatively small fields), etc.
    Aside of that, you will probably have to remove all explicit MyISAM and InnoDB references from schemas. Moreover, some programmers put those references inside their PHP/Perl/etc code, and such a program can even refuse to start if InnoDB support is disabled on SQL node; you will have to remove those stupid checks.

Be sure to allow large timeouts for HTTP connections when creating and populating your database from a web-based configuration wizard, as those operations will take very long in clustered mode, — otherwise your browser, or HAproxy, or httpd will break the connection.

Check that your database is really stored in cluster:

$ mysql -h data.example.com

> USE myclustereddb;

> SHOW TABLE STATUS;

Name   | Engine ...
----------------------
table1 | ndbcluster ...

If you see “MyISAM” or “InnoDB” instead of “ndbcluster”, then you have missed some explicit engine specification in your application, or haven’t redefined the default storage engine in SQL node configuration.

Again, let me remind you of the fact “mysql” system database is still local to each SQL node. For example, if a web-based (or script-based) setup wizard created a database user for you, this credentials are not replicated automatically to other SQL nodes.

Performance

To emphasize it again: the performance will probably very low, or just low, — much lower than you are enjoying with locally-stored databases. But the enhanced reliability is worth it, I think.

Awesome write up SamsonovAnton. You are to commended for sharing this information with us. Your write is book marked for use by anyone else that asks this question.

Thank You,