Mysql – Percona MySQL 5.7 Replication error 1236 “unknown error reading log event on the master”

MySQLmysql-replication

we are using two instances of 5.7.20-18-log (in the meantime updated to 5.7.21-21-lo) Percona Server in a master-master (dbm1 / dbm2) replication. The application only uses dbm1 for writes and reads so the dbm2 is used as a cold standby only.
Since a couple of weeks the replication on dbm2 stops multiple times per week with the error message "unknown error reading log event on the master".

Here is the full slave status (dbm2) from the last incident:

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 10.6.233.101
                  Master_User: replicator
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.007091
          Read_Master_Log_Pos: 71137
               Relay_Log_File: mysqld-relay-bin.001603
                Relay_Log_Pos: 71350
        Relay_Master_Log_File: mysql-bin.007091
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 71137
              Relay_Log_Space: 71605
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'unknown error reading log event on the master; the first event 'mysql-bin.007003' at 43445, the last event read from './mysql-bin.007091' at 71137, the last byte read from './mysql-bin.007091' at 71137.'
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
                  Master_UUID: 7cb8eccb-e1a0-11e7-9d96-545edb2e572a
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 180308 06:41:25
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

Generally we address this with a full restore of dbm2 with XtraBackup but after a few days the problem on dbm2 appears again.
As we can see in our monitoring, both servers have enough resources left (and no IO-Wait) – what we can see is, that this often happens when the application is doing some import jobs.

Both servers are located in the same datacenter, communication via the internal network.

More details:

OS: Ubuntu 14.04.5 LTS
Kernel: OpenVZ VPS (Kernel 2.6.32-042stab127.2)
CPU: 8 cores Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
RAM: 16 gb
Percona: 5.7.20-18-log Percona Server (GPL), Release '18', Revision '7ce54a6deef'

my.cnf:

[client]
port = 3306
socket = /var/run/mysqld/mysqld.sock

[isamchk]
key_buffer_size = 16M

[mysqld]
basedir = /usr
bind_address = *
binlog_cache_size = 1M
binlog_format = mixed
bulk_insert_buffer_size = 64M
datadir = /var/lib/mysql
expire_logs_days = 2
innodb_buffer_pool_dump_at_shutdown = ON
innodb_buffer_pool_load_at_startup = ON
innodb_buffer_pool_size = 12G
innodb_doublewrite = OFF
innodb_flush_log_at_trx_commit = 2
innodb_io_capacity = 1000
innodb_log_file_size = 256M
innodb_read_io_threads = 8
innodb_write_io_threads = 8
join_buffer_size = 244K
key_buffer_size = 100M
log-bin = mysql-bin
log-error = /var/log/mysql/error.log
log_slow_verbosity = full
long_query_time = 1
max_allowed_packet = 16M
max_binlog_files = 200
max_binlog_size = 100M
max_connections = 500
max_heap_table_size = 128M
max_relay_log_size = 256M
max_slowlog_files = 1
max_slowlog_size = 4G
myisam_sort_buffer_size = 64M
optimizer_switch = index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=off,block_nested_loop=off,batched_key_access=on,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on
pid-file = /var/run/mysqld/mysqld.pid
port = 3306
query_cache_limit = 1M
query_cache_size = 128M
query_cache_type = ON
read_buffer_size = 244K
relay_log = mysqld-relay-bin
server_id = 1
skip-external-locking
skip_name_resolve = ON
slow_query_log = ON
slow_query_log_file = /var/log/mysql/mysqld.slowlog
socket = /var/run/mysqld/mysqld.sock
sort_buffer_size = 2047K
sql_mode = STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
ssl = false
ssl-ca = /etc/mysql/cacert.pem
ssl-cert = /etc/mysql/server-cert.pem
ssl-key = /etc/mysql/server-key.pem
thread_cache_size = 8
thread_stack = 256K
tmp_table_size = 128M
tmpdir = /tmp
user = mysql
userstat = ON

[mysqld-5.0]
myisam-recover = BACKUP

[mysqld-5.1]
myisam-recover = BACKUP

[mysqld-5.5]
myisam-recover = BACKUP

[mysqld-5.6]
myisam-recover-options = BACKUP

[mysqld-5.7]
myisam-recover-options = BACKUP

[mysqld_safe]
log-error = /var/log/mysql/error.log
nice = 0
socket = /var/run/mysqld/mysqld.sock

[mysqldump]
max_allowed_packet = 16M
quick
quote-names

Any Ideas on how to mitigate the root cause for this type of replication error?

Edit to answer questions from @wilson-hauck:

MySQLTuner report from dbm1

>>  MySQLTuner 1.7.9 - Major Hayden <major@mhtx.net>
>>  Bug reports, feature requests, and downloads at mysqltuner.com
>>  Run with '--help' for additional options and output filtering

[--] Skipped version check for MySQLTuner script
[OK] Currently running supported MySQL version 5.7.20-18-log
[OK] Operating on 64-bit architecture

-------- Log file Recommendations ------------------------------------------------------------------
[--] Log file: /var/log/mysql/error.log(0B)
[OK] Log file /var/log/mysql/error.log exists
[OK] Log file /var/log/mysql/error.log is readable.
[!!] Log file /var/log/mysql/error.log is empty
[OK] Log file /var/log/mysql/error.log is smaller than 32 Mb
[OK] /var/log/mysql/error.log doesn't contain any warning.
[OK] /var/log/mysql/error.log doesn't contain any error.
[--] 0 start(s) detected in /var/log/mysql/error.log
[--] 0 shutdown(s) detected in /var/log/mysql/error.log

-------- Storage Engine Statistics -----------------------------------------------------------------
[--] Status: +ARCHIVE +BLACKHOLE +CSV -FEDERATED +InnoDB +MEMORY +MRG_MYISAM +MyISAM +PERFORMANCE_SCHEMA 
[--] Data in MEMORY tables: 46M (Tables: 2)
[--] Data in MyISAM tables: 9G (Tables: 368)
[--] Data in InnoDB tables: 12G (Tables: 131)
[OK] Total fragmented tables: 0

-------- Security Recommendations ------------------------------------------------------------------
[OK] There are no anonymous accounts for any database users
[OK] All database users have passwords assigned
[!!] There is no basic password file list!

-------- CVE Security Recommendations --------------------------------------------------------------
[--] Skipped due to --cvefile option undefined

-------- Performance Metrics -----------------------------------------------------------------------
[--] Up for: 6d 19h 49m 31s (3B q [5K qps], 7M conn, TX: 5877G, RX: 1160G)
[--] Reads / Writes: 94% / 6%
[--] Binary logging is enabled (GTID MODE: OFF)
[--] Physical Memory     : 16.0G
[--] Max MySQL memory    : 13.8G
[--] Other process memory: 160.2M
[--] Total buffers: 12.4G global + 3.0M per thread (500 max threads)
[--] P_S Max memory usage: 72B
[--] Galera GCache Max memory usage: 0B
[OK] Maximum reached memory usage: 13.1G (82.10% of installed RAM)
[!!] Maximum possible memory usage: 13.8G (86.35% of installed RAM)
[OK] Overall possible memory usage with other process is compatible with memory available
[OK] Slow queries: 0% (31K/3B)
[OK] Highest usage of available connections: 53% (266/500)
[OK] Aborted connections: 0.13%  (9830/7378007)
[!!] Query cache may be disabled by default due to mutex contention.
[OK] Query cache efficiency: 53.3% (1B cached / 3B selects)
[!!] Query cache prunes per day: 93123042
[OK] Sorts requiring temporary tables: 0% (261K temp sorts / 112M sorts)
[OK] No joins without indexes
[OK] Temporary tables created on disk: 12% (4M on disk / 35M total)
[OK] Thread cache hit rate: 88% (883K created / 7M connections)
[!!] Table cache hit rate: 0% (2K open / 360K opened)
[OK] Open file limit used: 28% (1K/5K)
[OK] Table locks acquired immediately: 99% (1B immediate / 1B locks)
[OK] Binlog cache memory access: 100.00% (47747694 Memory / 47748535 Total)

-------- Performance schema ------------------------------------------------------------------------
[--] Memory used by P_S: 72B
[--] Sys schema is installed.

-------- ThreadPool Metrics ------------------------------------------------------------------------
[--] ThreadPool stat is enabled.
[--] Thread Pool Size: 8 thread(s).
[!!] thread_pool_size between 16 and 36 when using InnoDB storage engine.

-------- MyISAM Metrics ----------------------------------------------------------------------------
[OK] Key buffer used: 100.0% (104M used / 104M cache)
[OK] Key buffer size / total MyISAM indexes: 100.0M/3.0G
[OK] Read Key buffer hit rate: 99.6% (41B cached / 152M reads)
[!!] Write Key buffer hit rate: 14.2% (1B cached / 242M writes)

-------- InnoDB Metrics ----------------------------------------------------------------------------
[--] InnoDB is enabled.
[--] InnoDB Thread Concurrency: 0
[OK] InnoDB File per table is activated
[!!] InnoDB buffer pool / data size: 12.0G/12.8G
[!!] Ratio InnoDB log file size / InnoDB Buffer pool size (4.16666666666667 %): 256.0M * 2/12.0G should be equal 25%
[!!] InnoDB buffer pool instances: 8
[--] Number of InnoDB Buffer Pool Chunk : 96 for 8 Buffer Pool Instance(s)
[OK] Innodb_buffer_pool_size aligned with Innodb_buffer_pool_chunk_size & Innodb_buffer_pool_instances
[OK] InnoDB Read buffer efficiency: 100.00% (55125068296 hits/ 55125632976 total)
[OK] InnoDB Write log efficiency: 92.59% (140220341 hits/ 151440226 total)
[OK] InnoDB log waits: 0.00% (0 waits / 11219885 writes)

-------- AriaDB Metrics ----------------------------------------------------------------------------
[--] AriaDB is disabled.

-------- TokuDB Metrics ----------------------------------------------------------------------------
[--] TokuDB is disabled.

-------- XtraDB Metrics ----------------------------------------------------------------------------
[--] XtraDB is disabled.

-------- RocksDB Metrics ---------------------------------------------------------------------------
[--] RocksDB is disabled.

-------- Spider Metrics ----------------------------------------------------------------------------
[--] Spider is disabled.

-------- Connect Metrics ---------------------------------------------------------------------------
[--] Connect is disabled.

-------- Galera Metrics ----------------------------------------------------------------------------
[--] Galera is disabled.

-------- Replication Metrics -----------------------------------------------------------------------
[--] Galera Synchronous replication: NO
[--] No replication slave(s) for this server.
[--] Binlog format: MIXED
[--] XA support enabled: ON
[--] Semi synchronous replication Master: Not Activated
[--] Semi synchronous replication Slave: Not Activated
[!!] This replication slave is running with the read_only option disabled.
[OK] This replication slave is up to date with master.

-------- Recommendations ---------------------------------------------------------------------------
General recommendations:
   Reduce your overall MySQL memory footprint for system stability
   Increasing the query_cache size over 128M may reduce performance
   Increase table_open_cache gradually to avoid file descriptor limits
   Read this before increasing table_open_cache over 64:
   This is MyISAM only table_cache scalability problem, InnoDB not affected.
   See more details here: 
   This bug already fixed in MySQL 5.7.9 and newer MySQL versions.
   Beware that open_files_limit (5000) variable 
   should be greater than table_open_cache (2000)
   Thread pool size for InnoDB usage (8)
   Read this before changing innodb_log_file_size and/or innodb_log_files_in_group: 
Variables to adjust:
   query_cache_size (=0)
   query_cache_type (=0)
   query_cache_size (> 128M) [see warning above]
   table_open_cache (> 2000)
   thread_pool_size between 16 and 36 for InnoDB usage
   innodb_buffer_pool_size (>= 12G) if possible.
   innodb_log_file_size should be (=1G) if possible, so InnoDB total log files size equals to 25% of buffer pool size.
   innodb_buffer_pool_instances(=12)

MySQLTuner report from dbm2

>>  MySQLTuner 1.7.9 - Major Hayden <major@mhtx.net>
 >>  Bug reports, feature requests, and downloads at mysqltuner.com
 >>  Run with '--help' for additional options and output filtering

[--] Skipped version check for MySQLTuner script
[OK] Currently running supported MySQL version 5.7.20-18-log
[OK] Operating on 64-bit architecture

-------- Log file Recommendations ------------------------------------------------------------------
[--] Log file: /var/log/mysql/error.log(0B)
[OK] Log file /var/log/mysql/error.log exists
[OK] Log file /var/log/mysql/error.log is readable.
[!!] Log file /var/log/mysql/error.log is empty
[OK] Log file /var/log/mysql/error.log is smaller than 32 Mb
[OK] /var/log/mysql/error.log doesn't contain any warning.
[OK] /var/log/mysql/error.log doesn't contain any error.
[--] 0 start(s) detected in /var/log/mysql/error.log
[--] 0 shutdown(s) detected in /var/log/mysql/error.log

-------- Storage Engine Statistics -----------------------------------------------------------------
[--] Status: +ARCHIVE +BLACKHOLE +CSV -FEDERATED +InnoDB +MEMORY +MRG_MYISAM +MyISAM +PERFORMANCE_SCHEMA 
[--] Data in InnoDB tables: 12G (Tables: 131)
[--] Data in MEMORY tables: 46M (Tables: 2)
[--] Data in MyISAM tables: 9G (Tables: 368)
[!!] Total fragmented tables: 2

-------- Security Recommendations ------------------------------------------------------------------
[OK] There are no anonymous accounts for any database users
[OK] All database users have passwords assigned
[!!] There is no basic password file list!

-------- CVE Security Recommendations --------------------------------------------------------------
[--] Skipped due to --cvefile option undefined

-------- Performance Metrics -----------------------------------------------------------------------
[--] Up for: 6d 21h 4m 15s (89K q [0.150 qps], 39K conn, TX: 325M, RX: 52G)
[--] Reads / Writes: 0% / 100%
[--] Binary logging is enabled (GTID MODE: OFF)
[--] Physical Memory     : 20.0G
[--] Max MySQL memory    : 13.8G
[--] Other process memory: 166.5M
[--] Total buffers: 12.4G global + 3.0M per thread (500 max threads)
[--] P_S Max memory usage: 72B
[--] Galera GCache Max memory usage: 0B
[OK] Maximum reached memory usage: 12.4G (61.86% of installed RAM)
[OK] Maximum possible memory usage: 13.8G (69.08% of installed RAM)
[OK] Overall possible memory usage with other process is compatible with memory available
[OK] Slow queries: 2% (1K/89K)
[OK] Highest usage of available connections: 0% (3/500)
[!!] Aborted connections: 24.95%  (9904/39702)
[!!] Query cache may be disabled by default due to mutex contention.
[!!] Query cache efficiency: 0.0% (0 cached / 29K selects)
[OK] Query cache prunes per day: 0
[!!] Sorts requiring temporary tables: 7837% (26K temp sorts / 342 sorts)
[OK] No joins without indexes
[OK] Temporary tables created on disk: 0% (392 on disk / 1M total)
[OK] Thread cache hit rate: 99% (3 created / 39K connections)
[!!] Table cache hit rate: 14% (1K open / 7K opened)
[OK] Open file limit used: 13% (657/5K)
[OK] Table locks acquired immediately: 100% (30M immediate / 30M locks)
[OK] Binlog cache memory access: 0% (0 Memory / 0 Total)

-------- Performance schema ------------------------------------------------------------------------
[--] Memory used by P_S: 72B
[--] Sys schema is installed.

-------- ThreadPool Metrics ------------------------------------------------------------------------
[--] ThreadPool stat is enabled.
[--] Thread Pool Size: 8 thread(s).
[!!] thread_pool_size between 16 and 36 when using InnoDB storage engine.

-------- MyISAM Metrics ----------------------------------------------------------------------------
[OK] Key buffer used: 99.9% (104M used / 104M cache)
[OK] Key buffer size / total MyISAM indexes: 100.0M/3.0G
[OK] Read Key buffer hit rate: 98.8% (6B cached / 73M reads)
[!!] Write Key buffer hit rate: 14.9% (1B cached / 186M writes)

-------- InnoDB Metrics ----------------------------------------------------------------------------
[--] InnoDB is enabled.
[--] InnoDB Thread Concurrency: 0
[OK] InnoDB File per table is activated
[!!] InnoDB buffer pool / data size: 12.0G/12.7G
[!!] Ratio InnoDB log file size / InnoDB Buffer pool size (4.16666666666667 %): 256.0M * 2/12.0G should be equal 25%
[!!] InnoDB buffer pool instances: 8
[--] Number of InnoDB Buffer Pool Chunk : 96 for 8 Buffer Pool Instance(s)
[OK] Innodb_buffer_pool_size aligned with Innodb_buffer_pool_chunk_size & Innodb_buffer_pool_instances
[OK] InnoDB Read buffer efficiency: 99.99% (2954768806 hits/ 2955078519 total)
[OK] InnoDB Write log efficiency: 95.93% (111839144 hits/ 116582132 total)
[OK] InnoDB log waits: 0.00% (0 waits / 4742988 writes)

-------- AriaDB Metrics ----------------------------------------------------------------------------
[--] AriaDB is disabled.

-------- TokuDB Metrics ----------------------------------------------------------------------------
[--] TokuDB is disabled.

-------- XtraDB Metrics ----------------------------------------------------------------------------
[--] XtraDB is disabled.

-------- RocksDB Metrics ---------------------------------------------------------------------------
[--] RocksDB is disabled.

-------- Spider Metrics ----------------------------------------------------------------------------
[--] Spider is disabled.

-------- Connect Metrics ---------------------------------------------------------------------------
[--] Connect is disabled.

-------- Galera Metrics ----------------------------------------------------------------------------
[--] Galera is disabled.

-------- Replication Metrics -----------------------------------------------------------------------
[--] Galera Synchronous replication: NO
[--] This server is acting as master for 1 server(s).
[--] Binlog format: MIXED
[--] XA support enabled: ON
[--] Semi synchronous replication Master: Not Activated
[--] Semi synchronous replication Slave: Not Activated
[!!] This replication slave is not running but seems to be configured.

-------- Recommendations ---------------------------------------------------------------------------
General recommendations:
    Run OPTIMIZE TABLE to defragment tables for better performance
      OPTIMIZE TABLE `REDACTED`.`REDACTED`; -- can free 191.703922271729 MB
      OPTIMIZE TABLE `REDACTED`.`REDACTED`; -- can free 143.067314147949 MB
    Total freed space after theses OPTIMIZE TABLE : 334.771236419678 Mb
    Reduce or eliminate unclosed connections and network issues
    Increase table_open_cache gradually to avoid file descriptor limits
    Read this before increasing table_open_cache over 64:
    This is MyISAM only table_cache scalability problem, InnoDB not affected.
    See more details here: 
    This bug already fixed in MySQL 5.7.9 and newer MySQL versions.
    Beware that open_files_limit (5000) variable 
    should be greater than table_open_cache (2000)
    Thread pool size for InnoDB usage (8)
    Read this before changing innodb_log_file_size and/or innodb_log_files_in_group: 
Variables to adjust:
    query_cache_size (=0)
    query_cache_type (=0)
    query_cache_limit (> 1M, or use smaller result sets)
    sort_buffer_size (> 1M)
    read_rnd_buffer_size (> 256K)
    table_open_cache (> 2000)
    thread_pool_size between 16 and 36 for InnoDB usage
    innodb_buffer_pool_size (>= 12G) if possible.
    innodb_log_file_size should be (=1G) if possible, so InnoDB total log files size equals to 25% of buffer pool size.
    innodb_buffer_pool_instances(=12)

ulimits of dbm1

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514886
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 514886
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

ulimits of dbm2

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1030587
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1030587
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

SHOW GLOBAL STATUS @ dbm1: https://gist.github.com/adriankirchner/cfffedfcc0699798e6c9276b09094ee1

SHOW GLOBAL STATUS @ dbm2: https://gist.github.com/adriankirchner/b021d8943f59c9a2448caf5dba6f89af

SHOW GLOBAL VARIABLES @ dbm1: https://gist.github.com/adriankirchner/b644072f063f80c2f41c604d5551eb18

SHOW GLOBAL VARIABLES @ dbm2: https://gist.github.com/adriankirchner/fa0ce51eb30a7368a30242fc9acba93b

Unfortunately iostat does not work inside of OpenVZ containers.
Just to be on the safe side I increased the RAM on both containers from 16GB to 20 GB.

The monitoring history for replication failures can be found here: https://i.stack.imgur.com/18Dff.png

Update 2018-04-30
Wilson provided us with a number of performance tweaks which we slowly deploy to this configuration – thank you very much for this!
In the meantime we updated both servers to Percona 5.7.21-21-log but the replication errors still occur.
When I dig into the binlog position stated in the error log ([..]the last event read from './mysql-bin.007091' at 71137, the last byte read from './mysql-bin.007091' at 71137.) this is the statement causing the error:

# at 71049
#180308  6:41:25 server id 1  end_log_pos 71137 CRC32 0x8bbeaab8    Query   thread_id=61875722  exec_time=0 error_code=0
SET TIMESTAMP=1520487685/*!*/;
COMMIT
/*!*/;
# at 71137 <---- position from the error message
#180308  6:41:25 server id 1  end_log_pos 71202 CRC32 0x87513662    Anonymous_GTID  last_committed=140  sequence_number=141 rbr_only=no
SET @@SESSION.GTID_NEXT= 'ANONYMOUS'/*!*/;
# at 71202
#180308  6:41:25 server id 1  end_log_pos 71289 CRC32 0xda97e104    Query   thread_id=61875722  exec_time=0 error_code=0
SET TIMESTAMP=1520487685/*!*/;
BEGIN
/*!*/;
# at 71289
#180308  6:41:25 server id 1  end_log_pos 71631 CRC32 0x7c566dd1    Query   thread_id=61875722  exec_time=0 error_code=0
SET TIMESTAMP=1520487685/*!*/;
update some_tables set some_col = 'some_value' where some_pk = 'some_value' <-- REDACTED
/*!*/;
# at 71631
#180308  6:41:25 server id 1  end_log_pos 71719 CRC32 0x37a2a3a9    Query   thread_id=61875722  exec_time=0 error_code=0
SET TIMESTAMP=1520487685/*!*/;
COMMIT
/*!*/;

So the statement causing the error is SET @@SESSION.GTID_NEXT= 'ANONYMOUS'/*!*/; – I verified this with the binlog of another replication outage. There are very few records on this:
https://bugs.mysql.com/bug.php?id=85480
https://groups.google.com/forum/#!topic/percona-discussion/u1k85oY5DB8

Both are operating with GTID_MODE=ON (we use GTID_MODE=OFF) and are facing totally different error message in the slave status.

Best Answer

This error is usually caused by a hardware glitch or a bug so obscure that it hasn't been found in over a decade. You haven’t mentioned anything about the hardware, so the obvious first question to ask is whether it has ECC memory and whether you are seeing any MCE errors in your syslog, or if sensors are showing overheating on any of the components. If you are always seeing this on the same server(s) (but not ALL servers), then failing hardware could be a very likely explanation.

The optimal solution is obviously to fully re-initialize the replica, as that will avoid any risk of anomalies creeping in.

Having said that, this also clearly unpalatable if you have terabytes of data, so there may be a way you can restore replication without rebuilding the slave completely. Use mysqlbinlog on the master to identify where the transactions begin, and look for the adjacent transactions near positions 71137 (your Exec_Master_Log_Pos) and 43445 (where the error says it choked on reading the binlog). There is a very good chance that the positions listed are actually erroneous and no transaction in the binlog starts at those positions.

If that is the case, you can reset your replication coordinates on the slave to the last transaction mysqlbinlog lists that is before the position the slave thinks it is at. You can find more details on exactly what to do in this mysql troubleshooting article. There is risk associated with doing this (if the coordinates got corrupted, there is no telling what else got corrupted), but if doing the full re-initialisation immediately is prohibitively expensive, you might be able to buy yourself some time using this approach, at least until you can verify (e.g. using pt-table-checksum whether you have to rebuild the entire replica data set.