I have read about the versioning feature for S3 buckets, but I cannot seem to find if >recovery is possible for files with no modification history. See the AWS docs here on >versioning:
I've just tried this. Yes, you can restore from the original version. When you delete the file it makes a delete marker and you can restore the version before that, i.e: the single, only, revision.
Then, we thought we may just backup the S3 files to Glacier using object lifecycle >management:
But, it seems this will not work for us, as the file object is not copied to Glacier but >moved to Glacier (more accurately it seems it is an object attribute that is changed, but >anyway...).
Glacier is really meant for long term storage, which is very infrequently accessed. It can also get very expensive to retrieve a large portion of your data in one go, as it's not meant for point-in-time restoration of lots of data (percentage wise).
Finally, we thought we would create a new bucket every month to serve as a monthly full >backup, and copy the original bucket's data to the new one on Day 1. Then using something >like duplicity (http://duplicity.nongnu.org/) we would synchronize the backup bucket every >night.
Don't do this, you can only have 100 buckets per account, so in 3 years you'll have taken up a third of your bucket allowance with just backups.
So, I guess there are a couple questions here. First, does S3 versioning allow recovery of >files that were never modified?
Yes
Is there some way to "copy" files from S3 to Glacier that I have missed?
Not that i know of
So what happens if I upload a file/archive, then later, the file changes locally, and the next time I do a backup, how does Glacier deal with this since it can't overwrite the file with a new version?
Per the Glacier FAQ:
You store data in Amazon Glacier as an archive. Each archive is assigned a unique archive ID that can later be used to retrieve the data. An archive can represent a single file or you may choose to combine several files to be uploaded as a single archive. You upload archives into vaults. Vaults are collections of archives that you use to organize your data.
So what this means is each file you upload is assigned a unique ID. Upload the same file twice and each copy of the file gets its own ID. This gives you the ability to restore to previous versions of the file if desired.
Use the locally stored archive inventory to determine what data doesn't exist anymore and if it's > 3 months old, delete it from Glacier? That seems straightforward but is there a better approach to this?
To avoid the surcharge for deleting data less than 3 months old this is likely the best approach. But it won't just be the data that doesn't exist any more that you need to track & delete. As mentioned above, any time a file changes and you re-upload it to Glacier you'll get a new ID for the file. You'll eventually want to delete the older versions of the file as well, assuming you don't want the ability to restore to those older versions.
If a 20 MB zip file is uploaded that contains 10,000 files, and one of those files is changed locally, do I need to upload another 20 MB zip file? Now I'm required to eat the cost of storing 2 copies of almost everything in those zip files... Also, how would I deal with deleting things in a ZIP file that don't exist locally anymore? Since I don't want to delete the whole zip file, now I'm incurring fees to store files that don't exist anymore.
That's the tradeoff you really have to decide for yourself. Do you tar/zip everything and then be forced to track those files and everything in them, or is it worth it to you to upload files individually so you can purge them individually as they're no longer needed.
A couple other approaches you might consider:
- Have two or more tar/zip archives, one that contains files that are highly unlikely to change (like system files) and the other(s) containing configuration files and other things that are more likely to change over time.
- Don't bother with tracking individual files and back everything up in a single tar/zip archive that gets uploaded to Glacier. As each archive reaches the 3-month point (or possibly even later) just delete it. That gives you a very easy way to track & restore from a given point in time.
Having said all that, however, Glacier just may not be the best approach for your needs. Glacier is really meant for data archiving, which is different than just backing up servers. If you just want to do incremental backups of a server then using S3 instead of Glacier might be a better approach. Using a tool like Duplicity or rdiff-backup (in conjunction with something like s3fs) would give you the ability to take incremental backups to an S3 bucket and manage them very easily. I've used rdiff-backup on a few linux systems over the years and found it worked quite nicely.
Best Answer
Be aware that there is a cost to transition objects to the Glacier storage class (approximately US$0.05 per 1,000 transition requests, dependent on region, so changing 1,000,000 objects to Glacier would cost approximately US$50).