Mysql – MariaDB replication not auto reconnecting

mariadbMySQLreplication

I've got a simple master-slave replication setup amongst two MariaDB 5.5 servers.
The connection between the servers is somewhat flaky, causing disconnections every now and then.

However "master_retry_count=0" is set in my.ini on the slave, which, to my understanding, means that the slave should keep trying to reconnect to the master whenever a disconnection occurs, so there really shouldn't be any major issues with a flaky connection.

I've experienced replication stopping, however, a few times with the following error:

The slave I/O thread stops because SET @master_heartbeat_period on master failed. Error: Lost connection to MySQL server during query

When I've noticed this, replication has often been in the stopped state for hours, and simply issuing a "START SLAVE" query makes things fine again, but I'm not understanding why this issue occurs in the first place.

Would anyone happen to have any experience or be able to shed some light on this? Is there any way to ensure that replication doesn't stop because of a temporary connection issue?

Here's a sample log with the error:

130710  2:42:29 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
130710  2:42:30 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000341' at position 20051884
130710  2:56:26 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
130710  2:56:26 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000341' at position 20523725
130710  2:57:04 [ERROR] Slave I/O: error reconnecting to master 'rep_user@master_server:3306' - retry-time: 60  retries: 0  message: Lost connection to MySQL server at 'reading au
thorization packet', system error: 0, Error_code: 2013
130710  2:58:05 [Note] Slave: connected to master 'rep_user@master_server:3306',replication resumed in log 'mysql-bin.000341' at position 20523725
130710  3:26:31 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
130710  3:26:31 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000341' at position 20893274
130710  3:27:48 [ERROR] Slave I/O: error reconnecting to master 'rep_user@master_server:3306' - retry-time: 60  retries: 0  message: Lost connection to MySQL server at 'reading authorization packet', system error: 0, Error_code: 2013
130710  3:31:01 [Note] Slave: connected to master 'rep_user@master_server:3306',replication resumed in log 'mysql-bin.000341' at position 20893274
130710  3:32:19 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
130710  3:32:19 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000341' at position 21083757
130710  3:48:03 [ERROR] Slave I/O: The slave I/O thread stops because SET @master_heartbeat_period on master failed. Error: Lost connection to MySQL server during query, Error_code: 1593
130710  3:48:03 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000341', position 21083757

(from the log above, the connection is indeed very bad; it's usually not this bad but it happens from time to time. Fixing the connection is obviously ideal, but I'd also prefer that replication shouldn't stop because of this)

Possibly relevant my.ini settings on slave:

master_retry_count=0
slave_sql_verify_checksum=0
slave_net_timeout=1800
net_retry_count=10

I don't know much about the heartbeat, so I've never set it, but according to the status, the heartbeat period is 900 seconds.

Thanks for reading.

Best Answer

Looks like this is a bug in MariaDB and MySQL, fixed in MySQL 5.6.

The fix in MariaDB will be tracked here: https://mariadb.atlassian.net/browse/MDEV-6188