Mysql – DRBD with MySQL

drbdheartbeatMySQL

Question about using DRBD to provide HA for MySQL.

I need to be sure that my backup MySQL instance is always going to be in a functional state when the failover occurs. What happens, for example, if the primary dies part way through committing a transaction?

Are we going to end up with data copied to the secondary that mysql can't handle? Or, what if the network goes away while the two are syncing, and not all of the data makes it across.

It seems like it's possible to get into a state where incomplete data on the secondary makes it impossible for mysql to start up and read the database.

Am I missing something?

Best Answer

It depends, naturally, on the nature of the failover. It also sounds like you already know the answer to your question.

DRBD is, fundamentally, network RAID mirroring. Blocks in -> blocks out. You can run synchronously or asynchronously, depending on your latency requirements. Which of these you choose tremendously affects whether your replica is crash-consistent or not.

Reduced to that level, your question becomes: "what happens when MySQL starts up and it reads data files?" Either your data is well-formed and quiesced, and it starts without a hitch, or it's crash-consistent, and you might have consistency issues. (There's also the possibility that you have on-disk corruption, of course, and this can also be a problem with DRBD, especially if you somehow end up with a split-brain scenario.) Usually, it can recover itself by replaying logs if you're using a transactional engine, but sometimes you will have more serious issues. This is as true with DRBD as with other shared block storage, like a shared SAN volume or (heaven forbid) database files on NFS.

Hypothetically, an ACID-compliant database should always recover gracefully from incomplete transactions. In practice, and especially with some MySQL versions, this isn't always the case (largely because MySQL doesn't have the greatest legacy of ACID compliance, though things have improved in recent years). Keeping frequent backups is always a sensible thing to do.

There is no way to ensure that any high-availability system will always continue working on a failover. The best you can do is make the right decisions when architecting your HA solution, and test the crap out of them to validate your assumptions about how it's going to behave when things go wrong.

In your case, you may want to consider a standby slave in case you run into a consistency issue on the master's disk. It takes manual work to promote it, of course, but at least you won't be restoring hours- or days-old data.