Centos – could heartbeat v3 monitor service/resource status without pacemaker

centosheartbeatheartbeat3high-availabilitylinux-ha

I am trying to setup a HA system based on CentOS, I want to use heartbeat v3 for the "heartbeat" mechanism, resource start/stop, looks heartbeat can get it done.
But about the resource status, heartbeat cannot monitor it, for example httpd, we manually stop it, but it cannot be restarted or switch to another node.

Could heartbeat v3 achieve it?

Best Answer

You need to use a proper Cluster Resource Manager like Pacemaker, in conjunction with a messaging layer like Heartbeat or Corosync. So no, Heartbeat v3 is not going to cut it, because it only does the messaging/heartbeat part.

If you search this site for other Heartbeat and Pacemaker-related questions you'll see that the best supported and most stable and feature-rich HA stack is based on Corosync and Pacemaker. It is not wise to use any other combination nowadays, unless you have a very specific reason and know exactly what you are doing.

Here is some material on Corosync and Pacemaker to get you started: http://www.linuxjournal.com/content/ahead-pack-pacemaker-high-availability-stack?page=0,0 and http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/.

Related Solutions

Linux – Unexpected start of already-primary server processes when heartbeat on secondary is stopped

For what it is worth, I feel your pain. It seems that heartbeat considers the loss of the passive node the same as a takeover of the passive node, so it starts its services. When the start scripts failed, and there was no other node to which to failover, heartbeat stayed primary and shutdown all the services. The only way to get back up again is to re-start heartbeat when this happens.

We dealt with this problem by making a single script that starts all of the cluster services (IP, FS mount, ipvsadm, Apache, etc) only if they are not already running. We make sure the "all in one" init script only returns non-zero for actual startup failures (and not for warnings like "already running") to avoid problems like this.

Monitoring Varnish with Heartbeat and Pacemaker

Your cluster architecture confuses me, as it seems you are running services that should be cluster-managed (like Varnish) standalone on two nodes at the same time and let the cluster resource manager (CRM) just juggle IP addresses around.

What is it you want to achieve with your cluster setup? Fault tolerance? Load balancing? Both? Mind you, I am talking about the cluster resources (Varnish, IP addresses, etc), not the backend servers to which Varnish distributes the load.

To me it sounds like you want an active-passive two-node cluster, which provides fault tolerance. One node is active and runs Varnish, the virtual IP addresses and possibly other resources, and the other node is passive and does nothing until the cluster resource manager moves resources over to the passive node, at which point it becomes active. This is a tried-and-true architecture that is as old as time itself. But for it to work you need to give the CRM full control over the resources. I recommend following Clusters from Scratch and modelling your cluster after that.

Edit after your updated question: your CIB looks good, and once you patched the Varnish init script so that repeated calls to "start" return 0 you should be able to add the following primitive (adjust the timeouts and intervals to your liking):

primitive p_varnish lsb:varnish \
    op monitor interval="10s" timeout="15s" \
    op start interval="0" timeout="10s" \
    op stop interval="0" timeout="10s"

Don't forget to add it to the balancer group (the last element in the list):

group balancer eth0_gateway eth1_iceman_slider eth1_iceman_slider_ts \
    eth1_iceman_slider_pm eth1_iceman_slider_jy eth1_iceman eth1_slider \
    eth1_viper eth1_jester p_varnish

Edit 2: To decrease the migration threshold add a resource defaults section at the end of your CIB and set the migration-threshold property to a low number. Setting it to 1 means the resource will be migrated after a single failure. It is also a good idea to set resource stickiness so that a resource that has been migrated because of node failure (reboot or shutdown) does not automatically get migrated back later when the node is available again.

rsc_defaults $id="rsc-options" \
    resource-stickiness="100" \
    migration-threshold="1"

Best Answer

Related Solutions

Linux – Unexpected start of already-primary server processes when heartbeat on secondary is stopped

Monitoring Varnish with Heartbeat and Pacemaker

Related Topic