MySQL replication issues after a power outage

disaster-recoveryMySQLmysql-replicationreplication

After a power outage at our data centre, the slave MySQL databases are struggling.

This is in the logs for one of the slaves:

100118 10:05:56 [Note] Slave I/O thread: connected to master 'repl@db1:3306',  replication started in log 'bin-log.004712' at position 724207814
100118 10:05:56 [ERROR] Error reading packet from server: Client requested master to start replication from impossible position ( server_errno=1236)
100118 10:05:56 [ERROR] Got fatal error 1236: 'Client requested master to start replication from impossible position' from master when reading data from binary log
100118 10:05:56 [Note] Slave I/O thread exiting, read up to log 'bin-log.004712', position 724207814

And the console shows a little more detail:

mysql> show slave status \G;
*************************** 1. row ***************************
             Slave_IO_State:
                Master_Host: db1
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 10
            Master_Log_File: bin-log.004712
        Read_Master_Log_Pos: 724207814
             Relay_Log_File: mysqld-relay-bin.000105
              Relay_Log_Pos: 98
      Relay_Master_Log_File: bin-log.004712
           Slave_IO_Running: No
          Slave_SQL_Running: Yes
            Replicate_Do_DB: mmplive1,mmpjcr,fui
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 724207814
            Relay_Log_Space: 98
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

ERROR:
No query specified

Looking at the bin logs on the master, we have:

-rw-rw---- 1 mysql mysql  724200412 Jan 18 09:22 bin-log.004712
-rw-rw---- 1 mysql mysql       1904 Jan 18 09:27 bin-log.index
-rw-rw---- 1 mysql mysql    5046830 Jan 18 11:22 slow-log
-rw-rw---- 1 mysql mysql  198249613 Jan 18 11:24 bin-log.004713
  1. The Slave status shows that the Exec_Master_Log_Pos and Read_Master_Log_Pos are both 724207814, for then binary log bin-log.004712. It is my understanding that this value is the byte position in the binary log file.
  2. That bin-log.004712 file is only 724200412 bytes, so the slaves think that they've done 7402 bytes more work than has actually been persisted in the file (which is on a ext3 fs, RAID-10, RHEL5). Hence the error message about impossible log position, etc.

What's the best way to fix the slaves?

Options I'm considering:

  1. Setting each slave to point at position 0 in the next bin-log file (bin-log.004713) and let them go, but I'm not sure how safe that is, or how much data might be lost.
  2. Do I instead need to do a full backup and restore (with presumed associated downtime due to table locks on the InnoDB tables)? I'd like to avoid that if possible.

UPDATE:

I missed off another option: to point each slave executed position back a little bit, so that it would try replicating commands that it had already processed from bin-log.004712.

Best Answer

I went for the first option.

That worked up to the point where the slave started trying to do inserts that conflicted with the primary keys. The slave had done more work than the master bin-log had persisted, as previously mentioned. One aspect I didn't anticipate was that the slave contained data that wasn't in the master; i.e. the slave persisted some transactions before the power outage that the master HAD NOT persisted.

Since for my case, these transactions weren't payment-related or similar, I chose to delete the data from the slave (thereby losing some data that had happened, but which didn't exist in the master) and then let replication run again. This brought the slaves completely up to date. If the data had been more important, we have auto-increment offsets sufficient to give us some wriggle room for manually wrangling the data and ensuring referential integrity was not compromised. Thankfully we didn't need to do that in this case.

For a machine in (passive) master-master configuration which was in this predicament, I chose a similar approach. By passive master-master, I mean that we have an active master (serverA) which is where all the writes go, and a passive master (serverB) that is in place to allow for schema updates to take place with zero downtime. The data in active master (serverA) was chosen as the true values, despite knowing that this meant we lost a couple of persisted transactions which weren't deemed important.

  • Changed the log file and position on the slave.

    CHANGE MASTER MASTER_LOG_FILE='bin-log.004713', MASTER_LOG_POS=0; -- on serverB
    
  • Restarted slave replication on the passive master (serverB) until it failed with primary key constraint violations, as with the other slaves.

     START SLAVE; -- on serverB
    
  • Stopped slave replication from the passive master (serverB) to the active master (serverA).

    STOP SLAVE; -- on serverA
    
  • DELETE the rows on the slave (serverB) which didn't exist in the master on serverA.

    DELETE FROM SOME_TABLE WHERE ID IN (???,????); -- on serverB
    SHOW MASTER STATUS\G; -- get the new master log position on serverB
    
  • Move the active master (serverA) slave exec position to skip those deletes from the passive master (serverB).

    CHANGE MASTER TO MASTER_LOG_POS=???; --on serverA; use the value just obtained from serverB
    
  • Restart replication on both the active master (serverA) and passive master.

    START SLAVE; -- on both machines. serverA does nothing and serverB starts catching up.