Ubuntu – pacemaker not able to start MySQL on second node

corosyncMySQLpacemakerUbuntu

I have a pacemaker/corosync/drbd setup on 2 physically idential Ubuntu Server 16.04 LTS and I am trying to achieve high availability for MySQL 5.7 and Apache 2.4.

Both servers where set up the exact same way and have the exact same packages installed. The only differences are hostnames, IP addresses and master/slave configuration in pacemaker/corosync/drbd.

My problem is that pacemaker is able to start the MySQL Server and every other service on node 1 but when I simulate a crash of node 1, pacemaker is not able to start the MySQL service on node 2.

This is the output of crm_mon (both nodes online):

Last updated: Wed Jan 10 18:57:02 2018          Last change: Wed Jan 10 18:00:19
2018 by root via crm_attribute on Server1
Stack: corosync
Current DC: Server1 (version 1.1.14-70404b0) - partition with quorum
2 nodes and 7 resources configured

Online: [ Server1 Server2 ]

 Master/Slave Set: ms_r0 [r0]
 Masters: [ Server1 ]
 Slaves: [ Server2 ]
Resource Group: WebServer
 ClusterIP  (ocf::heartbeat:IPaddr2):       Started Server1
 WebFS      (ocf::heartbeat:Filesystem):    Started Server1
 Links      (ocf::heartbeat:drbdlinks):     Started Server1
 DBase      (ocf::heartbeat:mysql): Started Server1
 WebSite    (ocf::heartbeat:apache):        Started Server1

But when I simulate the crash of node 1, I get:

Last updated: Wed Jan 10 19:05:25 2018          Last change: Wed Jan 10 19:05:17
2018 by root via crm_attribute on Server1
Stack: corosync
Current DC: Server1 (version 1.1.14-70404b0) - partition with quorum
2 nodes and 7 resources configured

Node Server1: standby
Online: [ Server2 ]

Master/Slave Set: ms_r0 [r0]
 Masters: [ Server2 ]
Resource Group: WebServer
 ClusterIP  (ocf::heartbeat:IPaddr2):       Started Server2
 WebFS      (ocf::heartbeat:Filesystem):    Started Server2
 Links      (ocf::heartbeat:drbdlinks):     Started Server2
 DBase      (ocf::heartbeat:mysql): Stopped
 WebSite    (ocf::heartbeat:apache):        Stopped

Failed Actions:
* DBase_start_0 on Server2 'unknown error' (1): call=45, status=complete
, exitreason='MySQL server failed to start (pid=3346) (rc=1), please check your
installation',
last-rc-change='Wed Jan 10 17:58:15 2018', queued=0ms, exec=2202ms

This was my inital Pacemaker configuration: https://pastebin.com/kEYjjgKw

After I recognized that there is a problem with the start of MySQL on node 2 I did some research and read that one shoudl pass some additional parameters to MySQL in the pacemaker configuration.
Thats why I changed the Pacemaker configuration to this: https://pastebin.com/J7Zk1kBA

Unfortunately this did not solve the problem.

From my understanding Pacemaker is using the same command on both machines to start the MySQL daemon. Thats why I find it kinda absurd that it is not able to start MySQL on the node 2 which was configured the exact same way.

drbd0 is getting mounted by pacemaker and drbdlinks is creating symbolic links for /var/www and /var/lib/mysql

I tested this funcionality and it seems to work. When node 1 is offline, drbd0 is mounted on node 2 and the symbolic links are created. /var/lib/mysql is pointing to drbd0 and all the files are in the directory.

If you have any ideas/advices on how to narrow the cause of this problem I would be really thankful if you could post them here.

If there is more information needed I am happy to provide it.

Thanks in advance!

Regards,
PAlbrecht

Best Answer

When I have had to work with pacemaker in the past, there are a few different procedures I use when troubleshooting this sort of thing. The general idea is to verify each dependency "layer" of the pacemaker config where the dependency graph is:

mysql -> mounting of filesystem -> DRBD master

Also Clusters from Scratch has a good walkthrough of a very similar config.

First thing is to make sure that DRBD is configured and synced up. On either node, run:

cat /proc/drbd

The output should show something like the following if DRBD is fully synced and ready for a failover (see p. 45 of CfS):

[root@pcmk-1 ~]# cat /proc/drbd
version: 8.4.6 (api:1/proto:86-101)
GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by phil@Build64R7, 2015-04-10
 05:13:52
 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:1048508 nr:0 dw:0 dr:1049420 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

If

cat /proc/drbd

outputs something like (also on p. 45 of CfS)

[root@ovz-node1 ~]# cat /proc/drbd
version: 0.7.17 (api:77/proto:74)
SVN Revision: 2093 build by phil@mescal, 2006-03-06 15:04:12
 0: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:627252 nr:0 dw:0 dr:629812 al:0 bm:38 lo:640 pe:0 ua:640 ap:0
        [=>..................] sync'ed:  6.6% (8805/9418)M
        finish: 0:04:51 speed: 30,888 (27,268) K/sec

then the system isn't in a state where it can successfully failover. Wait for it to complete and then retry your failover test.

Assuming DRBD was synced before the simulated failure of node1, the next thing to try after failing over to node2 when the DB is not running on node2 is to login to node2 and check the following:

  • Does cat /proc/drbd show node2 as primary?
  • Does mount show /dev/drbd0 mounted at its configured mount point (from pastebin, this should be '/sync')?
  • Are all your expected symlinks setup?
  • Do you see the same files in /sync on node2 as were present on node1 prior to the failover?

and most importantly, if all these questions are answered in the affirmative:

  • Will MySQL start successfully when started manually on node2 (perhaps using /etc/init.d/mysql start or systemctl equivalent)?
  • If MySQL starts, does the mysql client show that the running server is actually serving up the DB data stored under /sync? Can databases and tables known to be working on node1 be accessed using the mysql client on node2?

If MySQL starts up manually, then there's likely something wrong in its pacemaker config.

In full disclosure: I haven't personally used the ocf::heartbeat:mysql resource; instead I've used the 'lsb' resource 'lsb:mysql'.