Php – Server incremental backup to AWS Glacier

amazon-glacieramazon-web-servicesbackupcronPHP

I'm looking to backup various directories and files from a Linux server to AWS Glacier. I'm trying to work out the details on how to do manage this.

Incremental Backups

I want to upload files incrementally. So essentially, if a file hasn't changed, I don't want to upload it again to Glacier if it already exists on there. I think I have this part figured out. Because you can't get instant lists of the archives in your Glacier vault, I'll keep a local database of uploaded files, in order to be able to tell what exists in the vault and what doesn't. This will allow me to do incremental backups (only uploading missing or changed files).

Can't Overwrite Files?

According to (http://aws.amazon.com/glacier/faqs/):

Archives stored in Amazon Glacier are immutable, i.e. archives can be
uploaded and deleted but cannot be edited or overwritten.

So what happens if I upload a file/archive, then later, the file changes locally, and the next time I do a backup, how does Glacier deal with this since it can't overwrite the file with a new version?

Deleting Old Data

AWS charges $0.03 per GB to delete archives that are less than 3 months old. Since I am doing a backup of a local server, I want to delete archives that no longer exist locally. What is the best way to organize this. Use the locally stored archive inventory to determine what data doesn't exist anymore and if it's > 3 months old, delete it from Glacier? That seems straightforward but is there a better approach to this?

Individual files vs. TAR/ZIP files

You can upload either individual files as archives or be more efficient by grouping your files into TAR or ZIP files before uploading. The idea of TAR/ZIP files is appealing because it makes it more simple and you incur smaller storage fees, but I'm wondering how I would deal with incremental uploads. If a 20 MB zip file is uploaded that contains 10,000 files, and one of those files is changed locally, do I need to upload another 20 MB zip file? Now I'm required to eat the cost of storing 2 copies of almost everything in those zip files… Also, how would I deal with deleting things in a ZIP file that don't exist locally anymore? Since I don't want to delete the whole zip file, now I'm incurring fees to store files that don't exist anymore.

Maybe I'm overthinking all of this. What are the most straightforward ways to approach these questions?

I don't know if it matters or not, but I'm using the PHP SDK for this backup script. Also I don't want to upload to an S3 bucket first and then backup the bucket to Glacier since I would have to now pay for S3 storage and transfer fees as well.

Best Answer

So what happens if I upload a file/archive, then later, the file changes locally, and the next time I do a backup, how does Glacier deal with this since it can't overwrite the file with a new version?

Per the Glacier FAQ:

You store data in Amazon Glacier as an archive. Each archive is assigned a unique archive ID that can later be used to retrieve the data. An archive can represent a single file or you may choose to combine several files to be uploaded as a single archive. You upload archives into vaults. Vaults are collections of archives that you use to organize your data.

So what this means is each file you upload is assigned a unique ID. Upload the same file twice and each copy of the file gets its own ID. This gives you the ability to restore to previous versions of the file if desired.

Use the locally stored archive inventory to determine what data doesn't exist anymore and if it's > 3 months old, delete it from Glacier? That seems straightforward but is there a better approach to this?

To avoid the surcharge for deleting data less than 3 months old this is likely the best approach. But it won't just be the data that doesn't exist any more that you need to track & delete. As mentioned above, any time a file changes and you re-upload it to Glacier you'll get a new ID for the file. You'll eventually want to delete the older versions of the file as well, assuming you don't want the ability to restore to those older versions.

If a 20 MB zip file is uploaded that contains 10,000 files, and one of those files is changed locally, do I need to upload another 20 MB zip file? Now I'm required to eat the cost of storing 2 copies of almost everything in those zip files... Also, how would I deal with deleting things in a ZIP file that don't exist locally anymore? Since I don't want to delete the whole zip file, now I'm incurring fees to store files that don't exist anymore.

That's the tradeoff you really have to decide for yourself. Do you tar/zip everything and then be forced to track those files and everything in them, or is it worth it to you to upload files individually so you can purge them individually as they're no longer needed.

A couple other approaches you might consider:

Have two or more tar/zip archives, one that contains files that are highly unlikely to change (like system files) and the other(s) containing configuration files and other things that are more likely to change over time.
Don't bother with tracking individual files and back everything up in a single tar/zip archive that gets uploaded to Glacier. As each archive reaches the 3-month point (or possibly even later) just delete it. That gives you a very easy way to track & restore from a given point in time.

Having said all that, however, Glacier just may not be the best approach for your needs. Glacier is really meant for data archiving, which is different than just backing up servers. If you just want to do incremental backups of a server then using S3 instead of Glacier might be a better approach. Using a tool like Duplicity or rdiff-backup (in conjunction with something like s3fs) would give you the ability to take incremental backups to an S3 bucket and manage them very easily. I've used rdiff-backup on a few linux systems over the years and found it worked quite nicely.

Related Solutions

How to Do Incremental Backups with AWS

There's plenty of ways of doing backups, here's a few thoughts and options.

Incremental Backups

Ideally your backups should be incremental. This means you can roll back to previous versions if you need to. It can also significantly reduce CPU, bandwidth usage, and sometimes storage. You can also look at differential backups, which are slightly different, in that they're a delta to the last full backup.

On-Premise backups to AWS

The AWS Storage Gateway virtual appliance could be useful for you. Install the VM, allocate some disk. There are a few modes and they changed the names not so long ago, but in essence it works like a local disk that's backed up to S3. It can keep all your data on premise, or use the on-premise disk as a cache to S3.

Otherwise there are plenty of other options to get data into AWS.

AWS S3 Sync

I use the "aws s3 sync" command line utility to upload data to AWS S3, using the IA storage class. This keeps the backups safe on AWS S3. You can upload the files you want backed up direct to S3, you can upload the repository of a backup program that's stored locally, or your backup software may use S3 natively. Using the command line here's what you do.

aws s3 sync /path/to/backups s3://bucketname/backups --storage-class STANDARD_IA --delete --exclude ".sync/*"

You can turn on encryption and versioning within S3. It keeps each version of the file separately, it's not incremental, so it can use a more storage than incremental backup software that also does compression.

Destination

S3 is a good place for backups. If your backups are fairly static you can use S3 Infrequent Access class storage, which is cheaper than standard storage class. You can use the S3 Glacier storage class if your backups are immutable (never change), but remember it takes hours to get those backups back.

Dropbox is a reasonable place for backups. I've used the Dropbox Uploader script. It's not as flexible though - it doesn't delete files that have been removed locally, wasting disk space. This is how you use it.

/opt/Dropbox-Uploader/dropbox_uploader.sh -s -q upload /path/to/files /dropbox/path

**Borg Backup (no longer using) **

I backup my Linux server (which happens to be on AWS) using Borg Backup. This creates an incremental, de-duplicated backup on a local disk. It has retention policies which tell it how long to save data - eg every night for a week, once a week for a month, monthly for a year. There are plenty of incremental backup programs you can use.

One thing I don't love about Borg Backup is each time it runs it renames existing files. I think this behaviour changes for really large backups, but my 500MB backup definitely renames the file each night. You end up with hundreds of tiny files and one new / renamed file each day that's large. Because of that, if your backups are remote you'll probably use a lot more bandwidth that you'd expect. I stopped using Borg because of this.

Borg also supports remote repositories natively, accessed via SSH. You could have have an EC2 instance with an EBS disk come up, sync to it, then the instance go down - but EBS is a lot more expensive than S3 so it's not a great option.

**Restic Backup (now using for PC / web server backups) **

I've been using Restic Backup for a couple of years now. It aims to be easy, fast, verifiable, secure, efficient, and free. It works on most platforms (*nix, Windows, etc), and it's compiled to a single binary so installation is easy.

It's block based and efficient with disk / network. I've done multiple restore tests and it's always restored everything perfectly. I'm happy with restic. It's under slow but steady development, things like compression will be added in future - it does de-duplication but not compression as at August 2020.

Initialise the repository

set RESTIC_PASSWORD=abcdefg
restic_0.9.1_windows_amd64.exe init --repo x:\repository

set RESTIC_PASSWORD=abcdefg
restic_0.9.1_windows_amd64.exe --exclude c:\data\exclude --repo c:\data backup x:\repository

You can have Restic keep backups for configurable amounts of time - for example keep daily backups, then weekly backups for 8 weeks, monthly for 24 months.

restic_0.9.1_windows_amd64.exe  --repo x:\repository forget --keep-daily 7 --keep-weekly 8 --keep-monthly 24
restic_0.9.1_windows_amd64.exe  --repo x:\repository prune

If you want Restic to backup to S3 you just define your keys and do a backup like this. One thing to consider here is Restic may read the data from S3, even though it has a local cache, so IA class might end up more expensive than standard in some cases - though I suspect those would be rare cases.

Here's the basic setup of Restic:

REM setup S3 (once)
set AWS_ACCESS_KEY_ID=ABCDEFGHIJK
set AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXX/AAA
set RESTIC_PASSWORD=abcdefg
restic_0.9.1_windows_amd64.exe --repo s3:s3.amazonaws.com/s3-bucket-name init

This is how you do the backup

REM backup
restic_0.9.1_windows_amd64.exe --repo s3:s3.amazonaws.com/s3-bucket-name backup c:\data

Amazon S3 – How to PUT Files Directly to Glacier

Prior to these changes it was required to create an archive within Glacier and place files within that archive. The link you referenced details how Glacier is now a storage class of S3. You no longer need to move files into Glacier, you can simply upload them as storage class GLACIER or DEEP_ARCHIVE. You can also change the storage type of existing files via the Permission tab or from the command line.

From the AWS CLI, you can use a command similar to this:

aws s3 cp /etc/hosts s3://faketest/hosts --storage-class GLACIER

You can see the storage class using s3api:

aws s3api list-objects --bucket faketest

To do this from the console, click on the Properties tab and select GLACIER

You can similarly set the storage class if you upload a file through the console.

For existing files you can change their storage class through the CLI using something similar to:

aws s3api copy-object --copy-source faketest/temp.txt --bucket faketest --storage-class GLACIER --key temp.txt

The above command copies an existing file from the bucket back to the same bucket with a change to storage class. There may be alternative methods to this.

References

Glacier FAQ
S3 CLI cp
S3 Storage Classes
s3api copy-object