Linux server sync to an Amazon S3 bucket

amazon s3rsyncs3cmds3fs

I am looking for a stable solution to replace a classic server backup to another server using rsync.
I have to sync a whole filesystem (more than 1Tb) to Amazon S3.

Where am I?

Solution 1:
I mapped the S3 bucket to a mounting point in the system using s3fs.
System gets unstable and traffic is really slow. This is no way a solution.

Solution 2:
Using s3cmd sync command. Everything goes smooth at good speeds (at least for less than 2Gb folders).
The problem comes when I try to sync all the filesystem on the server (with some exclusions). The process just hangs.

Any hints?

Best Answer

This is a bad way to do backups. You should be separating your OS configuration from your valuable data. None of your permissions will be transferred, which in the Linux world are a necessity if you're planning on restoring backups (which you should be - backups without verified restorations are pointless).

Firstly, you can synchronise your valuable instance data (e.g. /var/www) to S3 using s3cmd sync as you've stated.

Secondly, using a configuration management utility such as Puppet or Chef, you can spin up a new instance of your OS with minimal effort, ensuring a fresh and reliable set of configurations.

There's no details of your underlying architecture in your question (EC2? VMware? KVM? Xen? Physical hardware?) so I can't recommend any specific tools (i.e. architecture-specific snapshotting). If you're running on a virtual platform (e.g. EC2, VMware, KVM) you should be using that platform's snapshotting architecture.

Related Solutions

Amazon S3 – How to Get the Size of an Amazon S3 Bucket

The AWS CLI now supports the --query parameter which takes a JMESPath expressions.

This means you can sum the size values given by list-objects using sum(Contents[].Size) and count like length(Contents[]).

This can be be run using the official AWS CLI as below and was introduced in Feb 2014

 aws s3api list-objects --bucket BUCKETNAME --output json --query "[sum(Contents[].Size), length(Contents[])]"

Linux filesystem or CDN for millions of files with replication

Millions of files in one directory is bad design and will be slow. Subdivide them into directories with smaller number of entries.

Take a look at https://unix.stackexchange.com/questions/3733/number-of-files-per-directory

Use RAID and /or SSDs. This will not in itself solve the slow access times, but if you introduce multiple directories and reduce the number of files per directory, say by an order of magnitude or two, it will help to prevent hotspots.

Consider XFS, especially when using multiple drives and multiple directories, it may give you nice gains (see e.g. this thread for options to use. It give some tips for XFS on md RAID).

Best Answer

Related Solutions

Amazon S3 – How to Get the Size of an Amazon S3 Bucket

Linux filesystem or CDN for millions of files with replication

Related Topic