HA (cluster) nodes keep getting fenced

Hi all,

I need some advice on how to proceed. I’ve got a 3 node setup based on Tumbleweed.
The nodes are setup to use SBD (Storage based Fencing) and the No-Quorum-Policy is set to “Ignore”.
Everything appears to be cool but the nodes (all of them) keep rebooting. After the reboot they stay up for a few minutes (sometimes a hour or so) and then they reboot one at the time.

I’m trying to figure out what is happening but I cannot find anything. Does anyone of you guys know where to log files are stored and what to look for.

My suspicion is that the fencing part works to well and that nodes are fencing the other nodes.

Anyone got advice ?

By the way: the cluster is running in ESX (Virtual Machines).

Over the years,
Some have posted about creating HA clusters, but not often.
On top of that, the basic architecture changed significantly about 2 yrs ago (IIRC).

If you’d like, post the steps you took setting up your cluster including any references you might have followed.
That way, any mis-steps might be identified, and others who are just curious might try to replicate what you’ve set up.

HTH,
TSU

Thanks for your reply.

Basically I followed the installation details from the SUSE Enterprise documentation but did it with openSUSE (tumbleweed) instead of SLES+HAE.

So the steps I took in order to get the cluster running:

  1. Setup shared storage (did this with openfiler in a lab and in production with Dell Equalogic)
  2. Setup ISCSI-Target in all nodes to connect to same shared storage.
  3. Setup name resolution (DNS/hostfile) & Time Sync.
  4. Setup passwordless login for SSH (ssh-keyget & ssh-copy-id)
  5. Installed pacemaker, corosync and ha-cluster-bootstrap. (the last one makes setting up a cluster fairly simple)
  6. Created a partition on shared storage to be used for storage based fencing as per SLE Documentation
  7. Then I ran the ha-cluster-init script to initialize the fist node.
  8. After the first node is up, all seems OK. Node1 is online, SBD Resource is running. Then I ran on the other node(s) ha-cluster-join.
  9. After a few minutes the second node comes online and resource can be moved.
  10. Tested the SBD by killing the network or kill the pacemaker process triggers a reboot (node got fenced)
  11. So at this time all seems really good. Time to configure MySQL cluster resource. Put the MySQL database file on a new partition of the shared storage.
  12. Configure a resource with a secondary IP.
  13. Tested again. Service is movable to other node (back and forth). If node get fenced. Resources moves to other node and the fenced node reboots en rejoins the cluster.

So in my setup and testing everything just went great. So we start using the database for PowerDNS. Still no problems. But when I look at HAWK (web management interface for pacemaker) I noticed that on a regular base on of the nodes is offline and after half a minute is back online and has joined the cluster. Apparently this triggers the other node to reboot as well. This process keeps on repeating for a couple of times (no pattern detected so far) and then clears out.

So I was thinking with a 2 node cluster you never get a majority where one or more nodes vote on who is the bad node so I added a extra node to the cluster. Now the same behavior is on three nodes where there are still 2 nodes online every time but on keeps getting fenced. After a couple of hours it looks stable but sometimes out of the blue the nodes start bouncing again.

I hope this provides some useful details with writing a complete How-To-Install-Cluster story.

Anyway Thanks and any help will be appreciated.

Cool.
Hope someone might respond.
I myself setup application layer clustering, so haven’t personally looked at this in a long time but to my eye the documentation you’re following appears up to date and incorporates major changes (particularly corosync) introduced a couple years ago.

I myself may take a look at this but will have to wait until at least next week…

BTW - YMMV for your purposes, but you may want to install the upgraded vmware tools. If you hadn’t heard, VMware deprecated distributing Tools updates and all upgrades moving towards the future has been turned over to an open source community project (yes, that means no more proprietary and maybe some community distribution). Starting a couple months ago (approx June 2015) all Tools should be upgraded using this project. Be aware though that** TW currently has issues**… I’ve identified that gcc has to be downgraded to gcc46 and the legacy network tools package has to be installed.
https://github.com/rasa/vmware-tools-patches

TSU

crm_report collects comprehensive diagnostic information from all cluster nodes. If you make result available someone may get a look; run it after node was fenced and came back online and tell exact date/time when it happened.

Sorry guys for my delayed reaction.

When I run the CRM report the cluster is triggert in to executiing the SBD stonith action and then one by one the nodes reboot and come back online.

In the main time i’ve being updating to to latest version of tumbleweed and somehow the nodes are surprisingly stable (at least for now).

I don’t know if the software updates are resonsible for this but for now it’ll do.

Thank you all for your help and advice.lol!

Cool, but seriously reconsider using TW for this. I can see the pros of having a rolling distro, but …

BTW The cluster has 2 nodes? IIRC Richard Brown in his presentation at osC2015 on ha_clusters mentioned three is the best option. The video is on openSUSE’s youtube channel.

If you are using fencing it does not really matter and you should be using fencing if there is even remote chance of modifying data concurrently.

Is there any indication in any of the logs (/var/log/corosync-all.log on our centos-based lustre setup) as to what is causing the fencing?

Given that you’re using virtual machines with an iscsi SBD, I can imagine there may be latency in that stack that is running into the timeout on the SBD.

Also, can the fencing events be correlated to load either on the physical machines or the storage device?