I have a pacemaker/corosync/drbd setup on 2 physically idential Ubuntu Server 16.04 LTS and I am trying to achieve high availability for MySQL 5.7 and Apache 2.4.
Both servers where set up the exact same way and have the exact same packages installed. The only differences are hostnames, IP addresses and master/slave configuration in pacemaker/corosync/drbd.
My problem is that pacemaker is able to start the MySQL Server and every other service on node 1 but when I simulate a crash of node 1, pacemaker is not able to start the MySQL service on node 2.
This is the output of crm_mon (both nodes online):
Last updated: Wed Jan 10 18:57:02 2018 Last change: Wed Jan 10 18:00:19
2018 by root via crm_attribute on Server1
Stack: corosync
Current DC: Server1 (version 1.1.14-70404b0) - partition with quorum
2 nodes and 7 resources configured
Online: [ Server1 Server2 ]
Master/Slave Set: ms_r0 [r0]
Masters: [ Server1 ]
Slaves: [ Server2 ]
Resource Group: WebServer
ClusterIP (ocf::heartbeat:IPaddr2): Started Server1
WebFS (ocf::heartbeat:Filesystem): Started Server1
Links (ocf::heartbeat:drbdlinks): Started Server1
DBase (ocf::heartbeat:mysql): Started Server1
WebSite (ocf::heartbeat:apache): Started Server1
But when I simulate the crash of node 1, I get:
Last updated: Wed Jan 10 19:05:25 2018 Last change: Wed Jan 10 19:05:17
2018 by root via crm_attribute on Server1
Stack: corosync
Current DC: Server1 (version 1.1.14-70404b0) - partition with quorum
2 nodes and 7 resources configured
Node Server1: standby
Online: [ Server2 ]
Master/Slave Set: ms_r0 [r0]
Masters: [ Server2 ]
Resource Group: WebServer
ClusterIP (ocf::heartbeat:IPaddr2): Started Server2
WebFS (ocf::heartbeat:Filesystem): Started Server2
Links (ocf::heartbeat:drbdlinks): Started Server2
DBase (ocf::heartbeat:mysql): Stopped
WebSite (ocf::heartbeat:apache): Stopped
Failed Actions:
* DBase_start_0 on Server2 'unknown error' (1): call=45, status=complete
, exitreason='MySQL server failed to start (pid=3346) (rc=1), please check your
installation',
last-rc-change='Wed Jan 10 17:58:15 2018', queued=0ms, exec=2202ms
This was my inital Pacemaker configuration: https://pastebin.com/kEYjjgKw
After I recognized that there is a problem with the start of MySQL on node 2 I did some research and read that one shoudl pass some additional parameters to MySQL in the pacemaker configuration.
Thats why I changed the Pacemaker configuration to this: https://pastebin.com/J7Zk1kBA
Unfortunately this did not solve the problem.
From my understanding Pacemaker is using the same command on both machines to start the MySQL daemon. Thats why I find it kinda absurd that it is not able to start MySQL on the node 2 which was configured the exact same way.
drbd0 is getting mounted by pacemaker and drbdlinks is creating symbolic links for /var/www and /var/lib/mysql
I tested this funcionality and it seems to work. When node 1 is offline, drbd0 is mounted on node 2 and the symbolic links are created. /var/lib/mysql is pointing to drbd0 and all the files are in the directory.
If you have any ideas/advices on how to narrow the cause of this problem I would be really thankful if you could post them here.
If there is more information needed I am happy to provide it.
Thanks in advance!
Regards,
PAlbrecht
Best Answer
When I have had to work with pacemaker in the past, there are a few different procedures I use when troubleshooting this sort of thing. The general idea is to verify each dependency "layer" of the pacemaker config where the dependency graph is:
mysql -> mounting of filesystem -> DRBD master
Also Clusters from Scratch has a good walkthrough of a very similar config.
First thing is to make sure that DRBD is configured and synced up. On either node, run:
The output should show something like the following if DRBD is fully synced and ready for a failover (see p. 45 of CfS):
If
outputs something like (also on p. 45 of CfS)
then the system isn't in a state where it can successfully failover. Wait for it to complete and then retry your failover test.
Assuming DRBD was synced before the simulated failure of node1, the next thing to try after failing over to node2 when the DB is not running on node2 is to login to node2 and check the following:
cat /proc/drbd
show node2 as primary?mount
show /dev/drbd0 mounted at its configured mount point (from pastebin, this should be '/sync')?and most importantly, if all these questions are answered in the affirmative:
/etc/init.d/mysql start
or systemctl equivalent)?If MySQL starts up manually, then there's likely something wrong in its pacemaker config.
In full disclosure: I haven't personally used the ocf::heartbeat:mysql resource; instead I've used the 'lsb' resource 'lsb:mysql'.