RDS MySQL – Fixing Point in Time Restore Failures

amazon-rdsamazon-web-servicesdatabase-restore

I have an RDS MySQL instance with MySQL version 5.7.19 for which a point in time restore failed with a status of "incompatible-state".

Through experimentation I have determined that point in time restores fail up until the next snapshot taken and work immediately after the snapshot was taken.

AWS Support tells me that

The restore operation encountered inserts/updates with strings
containing non-encoded binary characters which are a known source of
decoding issues. This issue is a terminal problem as it prevents us
from understanding where exactly the replay broke.

I'm trying to identify how to proactively mitigate the issue. Is there any log I could find such "non-encoded binary characters" proactively without doing a PIT restore? Or do I need to periodically do a restore and snapshot on failure to ensure I have a high amount of time periods covered by PIT restore?

Update

Restoring the snapshot from before the failed PITR shows this error log before running any queries against it:

2018-05-31T07:59:50.880143Z 0 [Warning] 'NO_AUTO_CREATE_USER' sql mode was not set.
2018-05-31T07:59:50.881641Z 0 [Warning] Insecure configuration for --secure-file-priv: Location is accessible to all OS users. Consider choosing a different directory.
2018-05-31T07:59:50.885191Z 0 [Note] /rdsdbbin/mysql/bin/mysqld (mysqld 5.7.19-log) starting as process 3470 ...
2018-05-31T07:59:51.126014Z 0 [Note] InnoDB: PUNCH HOLE support available
2018-05-31T07:59:51.126054Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2018-05-31T07:59:51.126060Z 0 [Note] InnoDB: Uses event mutexes
2018-05-31T07:59:51.126064Z 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2018-05-31T07:59:51.126068Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.3
2018-05-31T07:59:51.126071Z 0 [Note] InnoDB: Using Linux native AIO
2018-05-31T07:59:51.150879Z 0 [Note] InnoDB: Number of pools: 1
2018-05-31T07:59:51.182600Z 0 [Note] InnoDB: Using CPU crc32 instructions
2018-05-31T07:59:51.184624Z 0 [Note] InnoDB: Initializing buffer pool, total size = 1G, instances = 8, chunk size = 128M
2018-05-31T07:59:51.245944Z 0 [Note] InnoDB: Completed initialization of buffer pool
2018-05-31T07:59:51.274809Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2018-05-31T07:59:51.661104Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2018-05-31T07:59:52.079538Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 205460220725
2018-05-31T07:59:53.616196Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 205460333516
2018-05-31T07:59:54.463445Z 0 [Note] InnoDB: Database was not shutdown normally!
2018-05-31T07:59:54.463472Z 0 [Note] InnoDB: Starting crash recovery.
2018-05-31T07:59:56.869876Z 0 [Note] InnoDB: Transaction 1712897897 was in the XA prepared state.
2018-05-31T07:59:57.059082Z 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 0 row operations to undo
2018-05-31T07:59:57.059107Z 0 [Note] InnoDB: Trx id counter is 1712898304
2018-05-31T07:59:57.060227Z 0 [Note] InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 
2018-05-31T07:59:57.606094Z 0 [Note] InnoDB: Apply batch completed
2018-05-31T07:59:57.606126Z 0 [Note] InnoDB: Last MySQL binlog file position 0 7766, file name mysql-bin-changelog.196126
2018-05-31T08:01:05.616244Z 0 [Note] InnoDB: Starting in background the rollback of uncommitted transactions
2018-05-31T08:01:05.616271Z 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2018-05-31T08:01:05.617146Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2018-05-31T08:01:05.617158Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2018-05-31T08:01:05.617203Z 0 [Note] InnoDB: Setting file '/rdsdbdata/db/innodb/ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2018-05-31T08:01:07.653103Z 0 [Note] InnoDB: File '/rdsdbdata/db/innodb/ibtmp1' size is now 12 MB.
2018-05-31T08:01:07.654148Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2018-05-31T08:01:07.654159Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2018-05-31T08:01:07.870099Z 0 [Note] InnoDB: Waiting for purge to start
2018-05-31T08:01:07.920296Z 0 [Note] InnoDB: 5.7.19 started; log sequence number 205460333516
2018-05-31T08:01:07.920583Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 76646ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
2018-05-31T08:01:07.920723Z 0 [Note] InnoDB: Loading buffer pool(s) from /rdsdbdata/db/innodb/ib_buffer_pool
2018-05-31T08:01:07.924167Z 0 [Note] Plugin 'FEDERATED' is disabled.
2018-05-31T08:01:07.938375Z 0 [Note] Recovering after a crash using /rdsdbdata/log/binlog/mysql-bin-changelog
2018-05-31T08:01:07.938453Z 0 [Note] Starting crash recovery...
2018-05-31T08:01:07.938489Z 0 [Note] InnoDB: Starting recovery for XA transactions...
2018-05-31T08:01:07.938496Z 0 [Note] InnoDB: Transaction 1712897897 in prepared state after recovery
2018-05-31T08:01:07.938500Z 0 [Note] InnoDB: Transaction contains changes to 1 rows
2018-05-31T08:01:07.938504Z 0 [Note] InnoDB: 1 transactions in prepared state after recovery
2018-05-31T08:01:07.938506Z 0 [Note] Found 1 prepared transaction(s) in InnoDB
2018-05-31T08:01:07.975797Z 0 [Note] Crash recovery finished.
2018-05-31T08:01:08.009971Z 0 [Note] Salting uuid generator variables, current_pid: 3470, server_start_time: 1527753590, bytes_sent: 0, 
2018-05-31T08:01:08.013725Z 0 [Note] Generated uuid: 'c65e1206-64a8-11e8-acdc-0642b77ee370', server_start_time: 976718170713733380, bytes_sent: 47057241759744
2018-05-31T08:01:08.013758Z 0 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: c65e1206-64a8-11e8-acdc-0642b77ee370.
2018-05-31T08:01:08.044320Z 0 [Note] Server hostname (bind-address): '*'; port: 3306
2018-05-31T08:01:08.044373Z 0 [Note] IPv6 is available.
2018-05-31T08:01:08.044384Z 0 [Note] - '::' resolves to '::';
2018-05-31T08:01:08.044401Z 0 [Note] Server socket created on IP: '::'.
2018-05-31T08:01:09.042028Z 0 [Note] Event Scheduler: Loaded 0 events
2018-05-31T08:01:09.042160Z 0 [Note] /rdsdbbin/mysql/bin/mysqld: ready for connections.
Version: '5.7.19-log' socket: '/tmp/mysql.sock' port: 3306 MySQL Community Server (GPL)
2018-05-31T08:01:09.042169Z 0 [Note] Executing 'SELECT * FROM INFORMATION_SCHEMA.TABLES;' to get a list of tables using the deprecated partition engine. You may use the startup option '--disable-partition-engine-check' to skip this check. 
2018-05-31T08:01:09.042171Z 0 [Note] Beginning of list of non-natively partitioned tables
2018-05-31T08:01:09.889603Z 0 [Note] InnoDB: Buffer pool(s) load completed at 180531 8:01:09
2018-05-31T08:01:14.839437Z 0 [Note] End of list of non-natively partitioned tables
----------------------- END OF LOG ----------------------

A check for non InnoDB tables shows these:

+---------------+-------------------------+-------------------------------+------------+--------+---------+------------+------------+----------------+-------------+------------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------------+
| TABLE_CATALOG | TABLE_SCHEMA            | TABLE_NAME                    | TABLE_TYPE | ENGINE | VERSION | ROW_FORMAT | TABLE_ROWS | AVG_ROW_LENGTH | DATA_LENGTH | MAX_DATA_LENGTH  | INDEX_LENGTH | DATA_FREE | AUTO_INCREMENT | CREATE_TIME         | UPDATE_TIME         | CHECK_TIME | TABLE_COLLATION    | CHECKSUM | CREATE_OPTIONS | TABLE_COMMENT |
+---------------+-------------------------+-------------------------------+------------+--------+---------+------------+------------+----------------+-------------+------------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------------+
| def           | promosolutions_co_nz    | w90h1_finder_tokens           | BASE TABLE | MEMORY |      10 | Fixed      |          0 |            459 |           0 |         15040512 |            0 |         0 |           NULL | 2018-05-31 08:20:13 | NULL                | NULL       | utf8_general_ci    |     NULL |                |               |
| def           | promosolutions_co_nz    | w90h1_finder_tokens_aggregate | BASE TABLE | MEMORY |      10 | Fixed      |          0 |            474 |           0 |         15061350 |            0 |         0 |           NULL | 2018-05-31 08:20:13 | NULL                | NULL       | utf8_general_ci    |     NULL |                |               |
| def           | vc                      | CORE                          | BASE TABLE | MEMORY |      10 | Fixed      |          0 |            777 |           0 |         16627023 |            0 |         0 |           NULL | 2018-05-31 08:20:23 | NULL                | NULL       | latin1_swedish_ci  |     NULL |                |               |
| def           | vc                      | CORErs                        | BASE TABLE | MEMORY |      10 | Fixed      |          0 |            786 |           0 |         16649838 |            0 |         0 |           NULL | 2018-05-31 08:20:23 | NULL                | NULL       | latin1_swedish_ci  |     NULL |                |               |
| def           | vc                      | vw_mygroups                   | BASE TABLE | MyISAM |      10 | Fixed      |          0 |              0 |           0 | 1970324836974591 |         1024 |         0 |           NULL | 2018-05-01 04:44:19 | 2018-05-01 04:44:19 | NULL       | utf8_general_ci    |     NULL |                |               |
| def           | www_automatem_co_nz     | jos_finder_tokens             | BASE TABLE | MEMORY |      10 | Fixed      |          0 |            623 |           0 |         15553818 |            0 |         0 |           NULL | 2018-05-31 08:20:27 | NULL                | NULL       | utf8mb4_general_ci |     NULL |                |               |
| def           | www_automatem_co_nz     | jos_finder_tokens_aggregate   | BASE TABLE | MEMORY |      10 | Fixed      |          0 |            639 |           0 |         15582015 |            0 |         0 |           NULL | 2018-05-31 08:20:27 | NULL                | NULL       | utf8mb4_general_ci |     NULL |                |               |
| def           | www_bayautomotive_co_nz | jos_finder_tokens             | BASE TABLE | MEMORY |      10 | Fixed      |          0 |            623 |           0 |         15553818 |            0 |         0 |           NULL | 2018-05-31 08:20:28 | NULL                | NULL       | utf8mb4_general_ci |     NULL |                |               |
| def           | www_bayautomotive_co_nz | jos_finder_tokens_aggregate   | BASE TABLE | MEMORY |      10 | Fixed      |          0 |            639 |           0 |         15582015 |            0 |         0 |           NULL | 2018-05-31 08:20:28 | NULL                | NULL       | utf8mb4_general_ci |     NULL |                |               |
| def           | www_xpanda_co_nz        | jos_finder_tokens             | BASE TABLE | MEMORY |      10 | Fixed      |          0 |            623 |           0 |         15553818 |            0 |         0 |           NULL | 2018-05-31 08:20:36 | NULL                | NULL       | utf8mb4_general_ci |     NULL |                |               |
| def           | www_xpanda_co_nz        | jos_finder_tokens_aggregate   | BASE TABLE | MEMORY |      10 | Fixed      |          0 |            639 |           0 |         15582015 |            0 |         0 |           NULL | 2018-05-31 08:20:36 | NULL                | NULL       | utf8mb4_general_ci |     NULL |                |               |
+---------------+-------------------------+-------------------------------+------------+--------+---------+------------+------------+----------------+-------------+------------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+----------------+---------------+

(don't worry, none of these are hosted by this server anymore)

I have run select * on all these tables (all empty) and inserted a singe record into each. The error log shows no additional entries in CloudWatch.

Best Answer

In the instance's parameter group, set binlog_format to ROW. The default on RDS is MIXED even though there is not a compelling reason for this default value under a majority of conditions. In my systems, ROW is always used because it has substantial advantages over MIXED.

One of the more compelling reasons in this scenario is this part of the explanation:

The restore operation encountered inserts/updates with strings containing non-encoded binary characters

While this does not strike me as a complete explanation (because this is not actually invalid, absent a bug) we can easily avoid the problem as described -- the ROW logging format captures changed row images, instead of capturing insert/update(/delete) statements as "strings," so if there is a problem with the way either the RDS infrastructure or the underlying MySQL version handles these, this should eliminate it.

The MIXED mode allows the optimizer to choose whether to log the results of DML as the query that was issued or as images of the rows that were changed, and it will tend to log the statement if nothing internally flags the statement as non-deterministic for, or otherwise incompatible with, replay -- such as using things like the UUID() function (which would not generate the same value when invoked, in the future). (Other functions, like NOW() have add-on mechanisms built into the logging that make them deterministic for replay, counter-intuitively.)

The only meaningful caveat that comes to mind with ROW mode is that your tables must all have primary keys (which should be standard practice, anyway) else restoration and replication (if you have replicas) can be slower, since the entire table is scanned for deletes and updates in that condition... but setting that one consideration aside, the advantages of ROW based logging are such that I have not had occasion to use anything other than ROW for many years.

The second issue is to ensure than you are not using MyISAM or any other storage engines, except for the system tables that use them by default (do not try to change those).

MyISAM does not have InnoDB's crash recovery capability, and RDS relies on this capability to make snapshots usable. It takes precautions on the system tables, so they don't pose an issue, but for your tables, no protection is in place. A non-MyISAM table that is not in a safely coalesced state when the snapshot occurs may prevent the snapshot from working properly.

While support did not identify this as being your issue, it is consistent with your observation that point-in-time does not work until you create a new snapshot.

Although undocumented in any detail, my assumption based on anecdotal evidence is that point-in-time starts with the snapshot before but closest in time to the desired recovery point, and then applies all the relevant archived binary logs to play-forward to the desired recovery point. That's how you do it manually, outside of RDS, and based on the observable actions of the "rdsadmin" user, there's not a reason to believe that RDS reinvented the wheel.

Of course, the trouble with a snapshot that includes mangled MyISAM tables is, this isn't something that presents itself until after you restore the snapshot and then try to access one of the tables.

If this is actually the issue, support may have assumed that you were aware of the following and uncorrectly ruled it out as a possibility:

The MyISAM storage engine does not support reliable recovery and can result in lost or corrupt data when MySQL is restarted after a recovery, preventing Point-In-Time restore or snapshot restore from working as intended.

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_MySQL.html

If you are using MyISAM tables, you should convert those to InnoDB.

Related Topic