Linux – Amazon EC2 Backup Strategy With Restrictions (little to no snapshots can be taken?)

amazon ec2backuplinux

Similar questions have been asked but I need to know what would be recommended under the circumstances, to know if I'm missing something in my understanding of using EC2.

A small startup is running their business on the EC2 network and asked me for some advice on backup options. They're self-funded at the moment and are doing what they can to save costs when it's feasible. Without delving too much into the configuration of their systems, I'll give a web server as an example; it's a simple web server with a database. The rub is that they do not want the server taken down.

The person that has been doing the setup believes that they should either just do periodic dumps of the database and store that on S3, or create scripts that would rebuild a new server on Amazon when they're needed by backing up select folders holding configuration information. He suggested that to create snapshots of the server would be wasteful, as they take a lot of disk space and essentially there would be data rot between large data dumps so the snapshot would become outdated quickly.

My thought was to take a snapshot of the VM, and then do periodic dumps of the database and store in S3. If they were to lose the EC2 instance or have something like an update render it unusable, they could use the snapshot to build the server back up relatively quickly with the latest database dump, rather than start from scratch to build a new instance from a completely new AMI.

My understanding is that taking a snapshot of an EC2 instance (or EBS store) will require downtime, something they are hesitant to have. I also have read that you should have the server shut down to keep the filesystem consistent when the snapshot it taken. Since they do not have a cluster behind a balancer yet, these limit the options involving snapshots.

Scripting to build servers, unless there is something specific to Amazon I'm unaware of, would involve creating a Chef or Puppet server that could deploy new servers with their associated roles on EC2. Right now the startup doesn't have funding for keeping that kind of server in the wings and they, right now, don't really need to deploy that many servers.

Ideally they would have the funding to create a number of servers behind a virtual balancer or Amazon's balancer service, then take down the servers one at a time to perform updates or snapshots. Right now I'd be nervous at the idea of doing updates because if you're doing dumps of the database, that won't help if a system update alters a library on which their application relies and the service goes down.

I also supposed another option is to run a script that creates an EBS volume, mounts it, and on the server run something like rsync to capture most of the filesystem information to the EBS volume then compress and copy the contents to S3, disconnect the volume and destroy it to save storage cost, then do a database dump to catch in-flight data that would be inconsistent otherwise. For some of their servers it will most likely become necessary to save to temporary EBS volumes as their database needs grow.

A VMWare sandbox is being created to recreate their network systems in an environment where updates can be pre-tested before applying them to the production systems on Amazon. I am hoping that would minimize the possibility that a system update will kill their application.

So…given the restrictions of running one server, with database and application server on the system, looking to have as close to no downtime as possible (restricting the use of snapshots and having the backup process be as "hot" as possible (created live without taking the server down), am I on the wrong track for suggesting scheduling a time to create a snapshot of the EC2 instance in it's working state and from there doing database dumps to copy to S3? Is there a better strategy to pursue in creating a live backup of a server if snapshots will create downtime?

Best Answer

There is something interesting about this question - specifically with regard to the idea of downtime. Part of idea being that if an application is sensitive to downtime, then recovery time must also be factored in. (As an extreme argument, no backups require no downtime, unless you happen to need those backups, in which case the downtime may approach infinity).

A bit about EBS

EBS volumes and snapshots operate at a block level - a consequence of which allows snapshots to be taken while an instance is running, even if the EBS volume is in use. However, only data that is actually on the disk (i.e. not in a file cache) will be included in the snapshot. It is the latter reason that gives rise to the idea of consistent snapshots.

  • The recommended way is to detach the volume, snapshot it, and reattach it - usually not practical.
  • The next best option involves flushing the write-caches to disk, freezing the file system, and taking your snapshot

An interesting point here is that in both cases above, you do not need to wait for the snapshot to finish to reattach/unfreeze and resume writing to the disk - once the snapshot has been initiated your data will be consistent to that point in time. Typically this is only requires a few seconds during which your disk is write locked. Also since most databases structure their files on disk in a reasonable manner - there is a good chance that inserts have a minimal effect the existing blocks - which minimizes the data added to the snapshot.

Consider the point of the backup

EBS volumes are already replicated within an availability zone - so there is a degree of redundancy built in. If your instance terminates, you can simply attach the EBS volume to a new instance and (after you get past the lack of consistency) resume where you left off. In many regards this makes the EBS volume much like an inconsistent snapshot, provided that you can access it. That said, most EC2 users probably recall the cascading failures of EBS volumes from early 2011 - volumes were inaccessible for multiple days, and some users lost data as well.

RAID1

If you are trying to safeguard against the failure of an EBS disk (it does happen), you may consider a RAID1 setup. Since EBS volumes are block devices, you can easily use mdadm to set them up in your desired configuration. If one of your EBS volumes isn't performing to spec, it is easy enough to manually fail it (and later replace it with another EBS volume). Of course, this has downsides - increased time for every write, greater susceptibility to variable performance, double the I/O cost (monetariliy, not performance-wise), no real protection against a more widespread AWS failure (a common problem last year was the inability to detach EBS volumes that were in a locked state), and of course, the inconsistent state of the disk on failure.

S3FS

An option for certain applications (definitely NOT for databases) is to mount S3 as a local file system (e.g. via s3fs). This is slow, lacks some of the features one would expect from a file system, and may not behave as expected (eventual consistency). That said, for a simple purpose like making uploaded files available across instances, it may have merit. Obviously it isn't for anything that requires good read/write performance.

MySQL bin-log

One more option specific to MySQL may be the use of the bin-log. You can setup a second EBS volume that will store your bin-log (to minimize the effect of the added writes on your database), and use that in conjunction with whatever database dumps you take. You might even be able to do this with s3fs, which may actually have merit if the performance is tolerable (an rsync would probably better though than trying to use s3fs directly, and you will definitely want to compress what you can).

Once again, though, we come back to the idea of purpose. Consider what would happen with the above suggestions:

  • EBS volumes inaccessible:
    • RAID1 - useless, since you can't get to the data
    • bin-log - useless, unless you exported it to S3 - probably a delay though if you did that
  • Instance terminates unexpectedly:
    • RAID1 - your disks are available, but not consistent, your database may recover from the inconsistency on its own
    • bin-log - your data should be accessible, although you may need to review the last few events
  • Someone runs DROP DATABASE as root:
    • RAID1 - you have two perfect copies of a non-existent database
    • bin-log - you should be able to replay the events up to just before the command, so you should be ok

So really, RAID1 is mostly useless, and bin-log takes too long - both may have merit under certain circumstances, but are far from the idea backup.

Snapshots

It is important to note that snapshots are differential, and only store the actual blocks that contain data (and are compressed). Unlike with an EBS volume, where if you have a 20GB volume, but only use 1GB, you are still charged for the 'provisioned' storage (20GB). With a snapshot, you only are charged for what you use. If no data changes between snapshots, there is (theoretically) no charge. (Snapshots are charged for PUTS/GETS and used storage).

As an aside, I would highly recommend your application data (including databases) not be stored on your root volume (which you may already have setup). One of the advantages is that, hopefully, your root volume sees a minimum of change - meaning that its snapshots can be less frequent (or will have a minimum of change) reducing cost and ease of use.

It is also relevant to mention that you can delete old snapshots at any time - even though they are differential they will not affect the other snapshots. That said, each block allocated to a snapshot will not be relinquished until there is no snapshot that still references that block.

The problem with periodic dumps is firstly the time between dumps (possibly addressed by using MySQL's bin-log) and also the difficulty of recovery. It takes time to import a large dump and replay all the events from a bin-log. Also, creating a dump is not without its performance implications. Arguably, such dumps likely require far more storage than a snapshot. Setting up an EBS volume solely for the databases and snapshotting that would be preferable in most regards (that said, taking a snapshot does have a bit of a performance implication as well).

The beauty of snapshots and EBS volumes is that they can be used on other instances. If your instance fails to boot, you can attach the root volume to another instance to diagnose and fix the problem - or just to copy your data off it - and can switch root volumes with only a couple minutes of downtime (stop the instance, detach the root volume, attach a new root volume, start the instance). This same idea applies to having your data on a second EBS volume. Essentially, you just spin up a new instance from your custom AMI, and attach your current EBS volume to it - it helps minimize downtime.

(One could make the argument (and I probably wouldn't recommend it) that you could setup two copies of MySQL on the same server (Master-slave), using two EBS volumes, and then shutdown your slave to take a snapshot of its EBS volume - it will be consistent, with no downtime - but the performance costs are likely not worth it).

AWS does have autoscaling - which will be able to maintain a constant number of instances (even if that number is 1) - you would deploy from a snapshot however - so if your snapshot is not useful, then the premise isn't of much use.

Another couple of points - you can deploy as many instances as you want from a single snapshot (unlike an EBS volume, which can only be attached to a single instance at any given time). Also, EBS volumes are restricted to use within an availability zone, while snapshots can be used within a region.

Ideally, with a snapshot, if your server goes down, you can just launch a new one using the last snapshot - especially if you separate your root volume from your data, a bad update should result in a minimum of downtime (since you would just transfer the EBS volume containing your data across - and take a snapshot of it to preserve anything that might get corrupted due to inconsistency).

As a side note, Amazon states the failure rate of EBS volumes increases with the amount of data changed on them since the last snapshot.

Final recommendations

  • Use snapshots - they are great - they reduce downtime much more than they cause it
  • Separate data and the root volume, perhaps even putting the databases on their own volume, and snapshot before updates if necessary
  • Use the bin-log to stay as 'hot' as possible - upload this (compressed) to S3
  • Ensure you actually get the data off the instance (even if the data is intact on an EBS volume, the volume itself might be temporarily inaccessible).

Recommended Reading:

(I do believe I have written too much, but not said enough - but hopefully you find something worth the read).