Setup
I'm setting up an HA cluster for a web application using two physical servers in a Corosync/Pacemaker managed cluster.
After finding out I was heading the wrong way, I decided to use heartbeat's bundled MySQL resource agent to manage my MySQL instances across the cluster.
Currently, there is a working master/slave configuration from node1
(current master) to node2
(current slave).
Now I would like Pacemaker to manage my MySQL instances so it can promote/demote master or slave.
According to this (old) wiki page, I should be able to achieve the setup by doing so:
primitive p_mysql ocf:heartbeat:mysql \
params binary="/usr/sbin/mysqld" \
op start timeout="120" \
op stop timeout="120" \
op promote timeout="120" \
op demote timeout="120" \
op monitor role="Master" timeout="30" interval="10" \
op monitor role="Slave" timeout="30" interval="20"
ms ms_mysql p_mysql \
meta clone-max=3
As you can see, I did however change slightly the interval for the second op monitor
parameter, since I know Pacemaker identifies actions by Resource name (here, p_mysql
), action name, and interval. The interval was the only way to differentiate the monitor action on a slave node from the monitor action on a master node.
Problem
After committing the changes to the CID
and opening an interactive crm_mon
, I could see that Pacemaker failed to start the resource on every node. See attached screenshots:
Sorry cannot upload more than 2 links because I do not have enough reputation yet… Screenshots in comments
And it loops over and over, trying to set the current master to a slave, the current slave to a slave, then to a master… It is clearly looping and fails to instantiate properly MySQL instances.
For reference, my crm configure show
:
node 1: primary
node 2: secondary
primitive Failover ocf:onlinenet:failover \
params api_token=108efe5ee771368557869c7a837361a7c786f210 failover_ip=212.129.48.135
primitive WebServer apache \
params configfile="/etc/apache2/apache2.conf" statusurl="http://127.0.0.1/server-status" \
op monitor interval=40s \
op start timeout=40s interval=0 \
op stop timeout=60s interval=0
primitive p_mysql mysql \
params binary="/usr/sbin/mysqld" \
op start timeout=120 interval=0 \
op stop timeout=120 interval=0 \
op promote timeout=120 interval=0 \
op demote timeout=120 interval=0 \
op monitor role=Master timeout=30 interval=10 \
op monitor role=Slave timeout=30 interval=20
ms ms_mysql p_mysql \
meta clone-max=3
clone WebServer-clone WebServer
colocation Failover-WebServer inf: Failover WebServer-clone
property cib-bootstrap-options: \
dc-version=1.1.12-561c4cf \
cluster-infrastructure=corosync \
cluster-name=ascluster \
stonith-enabled=false \
no-quorum-policy=ignore
Best Answer
Solution
Thanks to the folks that investigated with me, I was able to find the solution to my problem and I do now have a working setup. If you feel brave enough, you can read the comments on the original question, but here is a summary of the steps that helped me solve my issue.
Read the source
First thing to do when setting up HA resources, will sound typical, but RTFM. No seriously, learn how the software you're planning to use works. In that particular case, my first mistake was not to read and understand carefully enough how the resource agent (RA) works. Since I was using the
mysql
RA provided byHeartbeat
, the RA source script was available on ClusterLabs' resource-agents GitHub repo.Make sure your software is up-to-date
Was not clearly identified as an issue in my particular case, but as @gf_ & @remote mind suggested, it is always a good thing to have a version of your RA that works with your software version.
Fill-in the damn params
That's not true, sometimes you can, but honestly, if I had provided every optional parameter that I could to the RA, I would have fixed my issue way quicker.
This is actually where the Read the source part is important, since it will allow you to truly understand why there are parameters needed. However, since they are often only briefly described, you may need to go further than the
meta-data
and find where are the parameters used. In my case, the thing did not work for several reasons:test_user
,test_passwd
: these were present in themeta-data
but I thought that I did not needed this. After I decided to look what it was used for, I simply found out that these parameters were used to perform a simpleselect count(*)
on the database. And since the defaults are set to use userroot
without password, it did not work in my case (because on my databases,root
needs a password to connect the database). This particular step prevented the RA from performing the check if the current node was a slave or not.Final word
Again, thanks a lot to @gf_ for taking the time to investigate with me and provide leads in order to debug my setup.
Good HA setups are not that easy to achieve (especially when starting from scratch), but if well configured can be really powerful and provide peace of mind.