Multi-state MySQL master/slave pacemaker resource fails to launch on cluster nodes

corosynchigh-availabilitymysql-replicationpacemaker

Setup

I'm setting up an HA cluster for a web application using two physical servers in a Corosync/Pacemaker managed cluster.

After finding out I was heading the wrong way, I decided to use heartbeat's bundled MySQL resource agent to manage my MySQL instances across the cluster.

Currently, there is a working master/slave configuration from node1 (current master) to node2 (current slave).
Now I would like Pacemaker to manage my MySQL instances so it can promote/demote master or slave.

According to this (old) wiki page, I should be able to achieve the setup by doing so:

primitive p_mysql ocf:heartbeat:mysql \
  params binary="/usr/sbin/mysqld" \
  op start timeout="120" \
  op stop timeout="120" \
  op promote timeout="120" \
  op demote timeout="120" \
  op monitor role="Master" timeout="30" interval="10" \
  op monitor role="Slave" timeout="30" interval="20"

ms ms_mysql p_mysql \
  meta clone-max=3

As you can see, I did however change slightly the interval for the second op monitor parameter, since I know Pacemaker identifies actions by Resource name (here, p_mysql), action name, and interval. The interval was the only way to differentiate the monitor action on a slave node from the monitor action on a master node.

Problem

After committing the changes to the CID and opening an interactive crm_mon, I could see that Pacemaker failed to start the resource on every node. See attached screenshots:

Sorry cannot upload more than 2 links because I do not have enough reputation yet… Screenshots in comments

And it loops over and over, trying to set the current master to a slave, the current slave to a slave, then to a master… It is clearly looping and fails to instantiate properly MySQL instances.

For reference, my crm configure show:

node 1: primary
node 2: secondary
primitive Failover ocf:onlinenet:failover \
    params api_token=108efe5ee771368557869c7a837361a7c786f210 failover_ip=212.129.48.135
primitive WebServer apache \
    params configfile="/etc/apache2/apache2.conf" statusurl="http://127.0.0.1/server-status" \
    op monitor interval=40s \
    op start timeout=40s interval=0 \
    op stop timeout=60s interval=0
primitive p_mysql mysql \
    params binary="/usr/sbin/mysqld" \
    op start timeout=120 interval=0 \
    op stop timeout=120 interval=0 \
    op promote timeout=120 interval=0 \
    op demote timeout=120 interval=0 \
    op monitor role=Master timeout=30 interval=10 \
    op monitor role=Slave timeout=30 interval=20
ms ms_mysql p_mysql \
    meta clone-max=3
clone WebServer-clone WebServer
colocation Failover-WebServer inf: Failover WebServer-clone
property cib-bootstrap-options: \
    dc-version=1.1.12-561c4cf \
    cluster-infrastructure=corosync \
    cluster-name=ascluster \
    stonith-enabled=false \
    no-quorum-policy=ignore

Best Answer

Solution

Thanks to the folks that investigated with me, I was able to find the solution to my problem and I do now have a working setup. If you feel brave enough, you can read the comments on the original question, but here is a summary of the steps that helped me solve my issue.

Read the source

First thing to do when setting up HA resources, will sound typical, but RTFM. No seriously, learn how the software you're planning to use works. In that particular case, my first mistake was not to read and understand carefully enough how the resource agent (RA) works. Since I was using the mysql RA provided by Heartbeat, the RA source script was available on ClusterLabs' resource-agents GitHub repo.

Do not forget to read the source of included files!

Make sure your software is up-to-date

Was not clearly identified as an issue in my particular case, but as @gf_ & @remote mind suggested, it is always a good thing to have a version of your RA that works with your software version.

Fill-in the damn params

Number one rule in HA: do not rely on default values.

That's not true, sometimes you can, but honestly, if I had provided every optional parameter that I could to the RA, I would have fixed my issue way quicker.

This is actually where the Read the source part is important, since it will allow you to truly understand why there are parameters needed. However, since they are often only briefly described, you may need to go further than the meta-data and find where are the parameters used. In my case, the thing did not work for several reasons:

  • I did not provide the socket path, and the default one for the script did not match the default one for my system (Debian 8).
  • I did not provide test_user, test_passwd: these were present in the meta-data but I thought that I did not needed this. After I decided to look what it was used for, I simply found out that these parameters were used to perform a simple select count(*) on the database. And since the defaults are set to use user root without password, it did not work in my case (because on my databases, root needs a password to connect the database). This particular step prevented the RA from performing the check if the current node was a slave or not.
  • Some other params were also missing, and I knew I needed them only after I discovered where the damn default settings were hidden.

Final word

Again, thanks a lot to @gf_ for taking the time to investigate with me and provide leads in order to debug my setup.

Good HA setups are not that easy to achieve (especially when starting from scratch), but if well configured can be really powerful and provide peace of mind.

Note: peace of mind not guaranteed ;)