Mysql – AWS RDS thesqldump hanging

amazon-rdsMySQL

I am attempting to migrate data between RDS MySQL instances. I cannot use snapshots because, with this migration, I want to encrypt the underlying disk and upgrade versions (5.1.73a to 5.5.41). The data is overwhelmingly located in one table; overall, the DB weighs 24.3 GB, and 23.9 GB are centered in the one table (a user login table).

In an effort to limit downtime, I am backing up the historical data within that one table — i.e. before the downtime, transfer all reads from the login table where the id is less than 89,000,000, and during the downtime rows where id is greater than or equal to 89,000,000. The command is:

mysqldump -u${source_user} --opt --skip-add-drop-table -p${source_password} --host=${source_host} ${database} ${table_name} --where="${where_clause}" | sed 's/CREATE TABLE/CREATE TABLE IF NOT EXISTS/g' | mysql -C -u${target_user} -p${target_password} --host=${target_host} ${database}

It's hacky but has worked well previously. I run it from a third server.

However, this time I'm hitting problems. I've run it in a couple of ways. When I run it as one block, the throughput is extremely variable, and eventually the entire process ends up hanging without any network load demonstrated on the coordinating server. I have also attempted to chunk the rows by id (i.e seq 0 100000 89000000), which starts off great but hangs on particular chunks — for example, the median 100k-row chunk takes about 8 seconds, but maybe one in ten rows takes 300+ seconds. I wouldn't even care if it took that long sometimes, but then this also happens:

mysqldump: Error 2013: Lost connection to MySQL server during query when dumping table at row: 42158

The target instance shows CPU usage of 100%, very busy bin logs, write 'iops' spiking close to 600, and queue depths, again spikey, near 20.
I have tried setting pretty much every timeout to 1000 seconds, doubling bin log size, to very little effect. My coworker speculates that this is a problem with write IOPS (it's an SSD based instance without provisioned IOPS for the record), but we've taken this same tack with similar servers and have not experienced this same issue. The source is a recent image of the current production server with a magnetic drive.

What am I missing? Thanks.

Best Answer

Instead of doing a dump and reload, I would temporarily set up replication between the two MySQL servers, where the master is the old server and the slave is the new one. You can then just let that take as long as it needs to take to finish, and when done, you can break replication, unconfigure the new instance as a replication slave, and begin using it in place of the original server. This limits your required downtime to only a few moments at the very end of the process, when you are breaking replication and switching which MySQL server your application uses.