Mysql – Fix MySQL group_replication applier module failed to start

MySQLmysql-replication

TL/DR: this question is not about master-slave replication. It is about NEW feature of multi-master group-replication (available starting v5.7.17). We have working workaround, but it takes hours due to large amount of data that needs to be imported twice, so the question is in finding proper way to solve this.

We have setup production MySQL multi-master cluster using group replication (available only 5.7.17 and higher).

However when adding new nodes to this cluster I get the error ERROR 3094 (HY000): The START GROUP_REPLICATION command failed as the applier module failed to start.

Trying to use solutions[1][2] that are used with similar error when setting up conventional master-slave replication do not work. The only working solution I've found is to run reset + source commands TWICE. Which takes hours to complete (the DB is very big). See details below.

The process as we add new node:

On existing node:
# mysqldump --all-databases --triggers --routines --events -u root -p > /home/user/dump.sql
Copy dump to new node
Setup new node completely clean with group replication config (see config example below)
mysql> SET SQL_LOG_BIN=0;
mysql> CREATE USER 'repl'@'%' IDENTIFIED BY 'xxxxxxxx' REQUIRE SSL;
mysql> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
mysql> FLUSH PRIVILEGES;
mysql> CHANGE MASTER TO MASTER_USER='repl', MASTER_PASSWORD='xxxxxxxx' FOR CHANNEL 'group_replication_recovery';
mysql> INSTALL PLUGIN group_replication SONAME 'group_replication.so';
mysql> SET SQL_LOG_BIN=1; 
mysql> RESET MASTER; #This is required because the server on first start will create new GTID which does not exist on other nodes
mysql> SOURCE /home/user/dump.sql;
mysql> START GROUP_REPLICATION;

In the error log I can see:

2017-07-19T04:00:24.452539Z 7 [ERROR] Failed to open the relay log './hostname1-relay-bin-group_replication_applier.000001' (relay_log_pos 4).
2017-07-19T04:00:24.452549Z 7 [ERROR] Could not find target log file mentioned in relay log info in the index file './hostname1-relay-bin-group_replication_applier.index' during relay log initialization.
2017-07-19T04:00:24.454935Z 7 [ERROR] Plugin group_replication reported: 'Failed to setup the group replication applier thread.'
2017-07-19T04:00:24.454957Z 7 [Note] Plugin group_replication reported: 'The group replication applier thread was killed'
2017-07-19T04:00:24.454986Z 4 [ERROR] Plugin group_replication reported: 'Unable to initialize the Group Replication applier module.'
2017-07-19T04:00:24.455347Z 4 [Note] Plugin group_replication reported: 'Requesting to leave the group despite of not being a member'
2017-07-19T04:00:24.455361Z 4 [ERROR] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'
2017-07-19T04:00:24.455372Z 0 [Note] Plugin group_replication reported: 'Destroying SSL'
2017-07-19T04:00:24.455380Z 0 [Note] Plugin group_replication reported: 'Success destroying SSL'

Example configuration:

[mysqld]
socket = /var/lib/mysql/mysql.sock
pid-file = /var/run/mysqld/mysqld.pid

log-error = /var/log/mysqld/error.log
symbolic-links = 0
skip_name_resolve

############# GROUP REPLICATION ###############
# General replication settings
gtid_mode = ON
enforce_gtid_consistency = ON
master_info_repository = TABLE
relay_log_info_repository = TABLE
binlog_checksum = NONE
log_slave_updates = ON
log_bin = mysql-bin
relay_log = relay-bin
binlog_format = ROW
transaction_write_set_extraction = XXHASH64
loose-group_replication_bootstrap_group = OFF
loose-group_replication_start_on_boot = ON
loose-group_replication_ssl_mode = REQUIRED
loose-group_replication_recovery_use_ssl = 1
loose-group_replication_single_primary_mode = OFF
loose-group_replication_enforce_update_everywhere_checks = ON
group_replication_auto_increment_increment = 3
auto_increment_increment = 3

# Shared replication group configuration
loose-group_replication_group_name = "84c75469-8959-4d4a-beb5-7753684a5161"
loose-group_replication_ip_whitelist = "192.168.25.0/24,127.0.0.0/8"
loose-group_replication_group_seeds = "192.168.25.2:13306,192.168.25.3:13306,192.168.25.4:13306"

# Host specific replication configuration
server_id = 10
bind-address = *
report_host = "192.168.25.5"
loose-group_replication_local_address = "192.168.25.5:13306"
##########  END GROUP REPLICATION ##########

Best Answer

Since loose-group_replication_start_on_boot is enabled at boot. In this case, you may need to stop the replication first and then start the replication. Try the below steps in your setup and let me know if it helps. Also, please post the complete error while this process:

Take the backup of the databases from the other MASTER node and restore it on the current node. Then run the below commands:

mysql> SET SQL_LOG_BIN=0;
mysql> CREATE USER 'repl'@'%' IDENTIFIED BY 'xxxxxxxx' REQUIRE SSL;
mysql> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
mysql> FLUSH PRIVILEGES;
mysql> CHANGE MASTER TO MASTER_USER='repl', MASTER_PASSWORD='xxxxxxxx' FOR        CHANNEL 'group_replication_recovery';
mysql> INSTALL PLUGIN group_replication SONAME 'group_replication.so';
mysql> SET SQL_LOG_BIN=1; 
mysql> STOP GROUP_REPLICATION;
mysql> START GROUP_REPLICATION;

After this, the new node should come in sync with the other master node. Let me know how it goes for you.

Related Topic