Galera bootstrap fails without errors

galeramariadb

I've been pulling my hair out all weekend trying to get a Galera cluster up and running, using MariaDB 10.1.5 (which includes Galera 25.3.9) on CentOS 7. This server runs two database instances, one only local, one as part of a cluster.

I've done this exact setup many times, without problems, but now I'm clueless…

my.cnf :

[mysqld_multi]
mysqld=/usr/bin/mysqld_safe --basedir=/usr
user=mysql

#
# Server local DB
#
[mysqld1]
mysqld=/usr/bin/mysqld_safe --basedir=/usr
user=mysql
port=3307
pid-file=/storage/local/mysqld.pid
socket=/storage/local/mysqld.sock
datadir=/storage/local
log-error=/var/log/mariadb-local.log

#
# Server cluster DB
#
[mysqld2]
wsrep_new_cluster
wsrep_cluster_address=gcomm://
#wsrep_cluster_address="gcomm://172.16.1.91,172.16.1.94,172.16.1.95"
wsrep_debug=1
query_cache_size=0
log-bin=/var/log/mariadb/cluster-bin.log
binlog_format=ROW
default_storage_engine=innodb
innodb_autoinc_lock_mode=2
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_name="MyApps"
wsrep_node_address=172.16.1.91
wsrep_node_name="dbnode1"
wsrep_sst_method=xtrabackup-v2
wsrep_sst_auth=sstuser:sstpass
mysqld=/usr/bin/mysqld_safe --basedir=/usr
user=mysql
port=3306
pid-file=/storage/cluster/mysqld.pid
socket=/storage/cluster/mysqld.sock
datadir=/storage/cluster
log-error=/var/log/mariadb-cluster.log

When starting this node I get this in the log:

2015-08-02 20:52:52 mysqld_safe Starting mysqld daemon with databases from /storage/cluster
2015-08-02 20:52:52 140344148072576 [Note] /usr/sbin/mysqld (mysqld 10.1.5-MariaDB-log) starting as process 11301 ...
2015-08-02 20:52:52 140344148072576 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2015-08-02 20:52:52 140344148072576 [Note] InnoDB: The InnoDB memory heap is disabled
2015-08-02 20:52:52 140344148072576 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2015-08-02 20:52:52 140344148072576 [Note] InnoDB: Memory barrier is not used
2015-08-02 20:52:52 140344148072576 [Note] InnoDB: Compressed tables use zlib 1.2.7
2015-08-02 20:52:52 140344148072576 [Note] InnoDB: Using Linux native AIO
2015-08-02 20:52:52 140344148072576 [Note] InnoDB: Not using CPU crc32 instructions
2015-08-02 20:52:53 140344148072576 [Note] InnoDB: Initializing buffer pool, size = 128.0M
2015-08-02 20:52:53 140344148072576 [Note] InnoDB: Completed initialization of buffer pool
2015-08-02 20:52:53 140344148072576 [Note] InnoDB: Highest supported file format is Barracuda.
2015-08-02 20:52:53 140344148072576 [Note] Plugin 'FEEDBACK' is disabled.
2015-08-02 20:52:53 140233078593664 [Note] Recovering after a crash using tc.log
2015-08-02 20:52:53 140233078593664 [Note] Starting crash recovery...
2015-08-02 20:52:53 140233078593664 [Note] Crash recovery finished.
2015-08-02 20:52:53 140232389265152 [Note] InnoDB: not started
2015-08-02 20:52:53 140344148072576 [Note] Server socket created on IP: '::'.
2015-08-02 20:52:53 140344148072576 [Note] Event Scheduler: Loaded 0 events
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Setting wsrep_ready to 0
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Read WSREPXid from InnoDB:  00000000-0000-0000-0000-000000000000:-1
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Read nil XID from storage engines, skipping position init
2015-08-02 20:52:53 140344148072576 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
2015-08-02 20:52:53 140344148072576 [Note] WSREP: wsrep_load(): Galera 25.3.9(r3387) by Codership Oy <info@codership.com> loaded successfully.
2015-08-02 20:52:53 140344148072576 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Passing config to GCS: base_host = 172.16.1.91; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /storage/cluster/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /storage/cluster//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.reco
2015-08-02 20:52:53 140343261148928 [Note] WSREP: Service thread queue flushed.
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Start replication
2015-08-02 20:52:53 140344148072576 [Note] WSREP: 'wsrep-new-cluster' option used, bootstrapping the cluster
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2015-08-02 20:52:53 140344148072576 [Note] WSREP: protonet asio version 0
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Using CRC-32C for message checksums.
2015-08-02 20:52:53 140344148072576 [Note] WSREP: backend: asio
2015-08-02 20:52:53 140344148072576 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory)
2015-08-02 20:52:53 140344148072576 [Note] WSREP: restore pc from disk failed
2015-08-02 20:52:53 140344148072576 [Note] WSREP: GMCast version 0
2015-08-02 20:52:53 140344148072576 [Note] WSREP: (71c20f80, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2015-08-02 20:52:53 140344148072576 [Note] WSREP: (71c20f80, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2015-08-02 20:52:53 140344148072576 [Note] WSREP: EVS version 0
2015-08-02 20:52:53 140344148072576 [Note] WSREP: gcomm: bootstrapping new group 'MyApps'
2015-08-02 20:52:53 140344148072576 [Note] WSREP: start_prim is enabled, turn off pc_recovery
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Node 71c20f80 state prim
2015-08-02 20:52:53 140344148072576 [Note] WSREP: view(view_id(PRIM,71c20f80,1) memb {
    71c20f80,0
} joined {
} left {
} partitioned {
})
2015-08-02 20:52:53 140344148072576 [Note] WSREP: save pc into disk
2015-08-02 20:52:53 140344148072576 [Note] WSREP: gcomm: connected
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2015-08-02 20:52:53 140344148072576 [Note] WSREP: Opened channel 'MyApps'
2015-08-02 20:52:53 140343215384320 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
2015-08-02 20:52:53 140343215384320 [Note] WSREP: Starting new group from scratch: 71c26ca7-3958-11e5-aa0e-92c6feb108bf
2015-08-02 20:52:53 140343215384320 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 71c2791f-3958-11e5-9fcc-52d3a70422f3
2015-08-02 20:52:53 140343215384320 [Note] WSREP: STATE EXCHANGE: sent state msg: 71c2791f-3958-11e5-9fcc-52d3a70422f3
2015-08-02 20:52:53 140343215384320 [Note] WSREP: STATE EXCHANGE: got state msg: 71c2791f-3958-11e5-9fcc-52d3a70422f3 from 0 (dbnode1)
2015-08-02 20:52:53 140343215384320 [Note] WSREP: Quorum results:
    version    = 3,
    component  = PRIMARY,
    conf_id    = 0,
    members    = 1/1 (joined/total),
    act_id     = 0,
    last_appl. = -1,
    protocols  = 0/7/3 (gcs/repl/appl),
    group UUID = 71c26ca7-3958-11e5-aa0e-92c6feb108bf
2015-08-02 20:52:53 140343215384320 [Note] WSREP: Flow-control interval: [16, 16]
2015-08-02 20:52:53 140343215384320 [Note] WSREP: Restored state OPEN -> JOINED (0)
2015-08-02 20:52:53 140343215384320 [Note] WSREP: Member 0.0 (dbnode1) synced with group.
2015-08-02 20:52:53 140343215384320 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 0)
2015-08-02 20:52:53 140344147060480 [Note] WSREP: New cluster view: global state: 71c26ca7-3958-11e5-aa0e-92c6feb108bf:0, view# 1: Primary, number of nodes: 1, my index: 0, protocol version 3
2015-08-02 20:52:53 140344148072576 [Note] /usr/sbin/mysqld: ready for connections.

This suggests everything is fine, but a check of the wsrep variables shows that WSREP_READY = OFF, and any attempt to access the engine returns

ERROR 1047 (08S01): WSREP has not yet prepared node for application use

The logs looks fine, no errors (except for the crash recovery, which happens because shutdown takes too long, probably because something is hanging), the only thing that worries me is

WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1

Which I find odd, I'm bootstrapping the cluster, should this not be bootstrap = yes?

Best Answer

The systemd unit files that come with mariadb 10.1 don't support bootstraping galera yet.

As a workaround you can use:

[root@galera01 mysql]# systemctl set-environment MYSQLD_OPTS="--wsrep-new-cluster"
[root@galera01 mysql]# systemctl start mariadb
[root@galera01 mysql]# systemctl unset-environment MYSQLD_OPTS

This is the mariadb bug tracking the issue: https://mariadb.atlassian.net/browse/MDEV-7752