Google Cloud SQL – Fixing MySQL Replica Instances Stuck on ‘System Lock’ State

databasegoogle-cloud-sqlinnodbMySQLreplication

Let me explain the problem in detail,

We have a Google cloud SQL setup(1 master, 1 failover replica and a read replica). From the last two days, we are facing replication delays on both the replica instances which is constantly increasing.(It's up to 16h as I write).

On pulling the logs from the replica instances, we see slave SQL and slave I/O thread being killed very frequently,

2019-07-15T12:35:19.181804Z 1025650 [Note] Slave SQL thread for channel '' exiting, replication stopped in log 'mysql-bin.068343' at position 62535096
2019-07-15T12:35:19.184434Z 1025649 [Note] Slave I/O thread exiting for channel '', read up to log 'mysql-bin.068473', position 63572825

Please find the relevant slave status info and processlist output below.

show slave status;

         Slave_IO_State: Waiting for master to send event
         Master_Log_File: mysql-bin.068826
         Read_Master_Log_Pos: 21806289
         Relay_Log_File: relay-log.000025
         Relay_Log_Pos: 16457199
         Relay_Master_Log_File: mysql-bin.068600
         Slave_IO_Running: Yes
         Slave_SQL_Running: Yes
         Seconds_Behind_Master: 52371
         Slave_SQL_Running_State: System lock
         Master_Retry_Count: 86400
show processlist;

| 1504576 | system user         | | NULL            | Connect |  3288 | Waiting for master to send event | NULL                                                                                                 |
| 1504577 | system user         | | NULL            | Connect | 52623 | System lock                      | NULL 

Slave related global variables.

| binlog_cache_size                                        | 32768
| binlog_checksum                                          | CRC32
| binlog_direct_non_transactional_updates                  | OFF
| binlog_error_action                                      | ABORT_SERVER
| binlog_format                                            | ROW
| binlog_group_commit_sync_delay                           | 0
| binlog_group_commit_sync_no_delay_count                  | 0
| binlog_gtid_simple_recovery                              | ON
| binlog_max_flush_queue_time                              | 0
| binlog_order_commits                                     | ON
| binlog_row_image                                         | FULL
| binlog_rows_query_log_events                             | OFF
| binlog_stmt_cache_size                                   | 32768
| slave_allow_batching                                     | OFF
| slave_checkpoint_group                                   | 512
| slave_checkpoint_period                                  | 300
| slave_compressed_protocol                                | OFF
| slave_exec_mode                                          | STRICT
| slave_load_tmpdir                                        | /mysql/tmp
| slave_max_allowed_packet                                 | 1073741824
| slave_net_timeout                                        | 30
| slave_parallel_type                                      | DATABASE
| slave_parallel_workers                                   | 0
| slave_pending_jobs_size_max                              | 16777216
| slave_preserve_commit_order                              | OFF
| slave_rows_search_algorithms                             | TABLE_SCAN,INDEX_SCAN
| slave_skip_errors                                        | OFF
| slave_sql_verify_checksum                                | ON
| slave_transaction_retries                                | 10
| slave_type_conversions                                   |

All three instances have 8 vCPU cores, 30GB RAM, and around 550GB SSD Storage, and are cloud SQL MySQL 2nd generation instances(MySQL version 5.7).
Master has a very stable CPU usage pattern at around 40% usage, and both the failover and read replica are at around 60% usage.

Does anybody have any idea why the Slave SQL thread is constantly in the "System lock" stage and not proceeding with the replication? Any pointers would be great!

Best Answer

Google Cloud Platform Support here!

Could you please confirm the QPS ratio on your master instance? If the master instance gets overloaded with queries, when the CPU usage goes beyond the maximum than it can process, row-based replication helps ease the load but as replication is single-threaded the replica gets blocked.

One good practice would be resharding your master instance to 2 or more instances to split the query load. You can refer to more best practices here.

To further assist you, please refer to the following link in which you can create a case at the public issue tracker. Doing so, we will be able to further investigate it with our internal tools and provide you with more meaningful information.

Related Topic