Which is the fastest way to copy 400G of files from an ec2 elastic block store volume to s3

amazon ec2amazon s3amazon-web-services

I have to copy 400G of files from an elastic block store volume to an s3 bucket… Those are about 300k files of ~1Mb

I've tried s3cmd and s3fuse, both of them are really, really slow.. s3cmd ran for a complete day, said it finished copying, and when I checked the bucket, nothing had happened (I suppose something went wrong, but at least s3cmd never complained of anything)

S3Fuse is working for an other complete day, and copied less than 10% of files…

Is there a better solution for this?

I'm running Linux (ubuntu 12.04) of course

Best Answer

There are several key factors that determine throughput from EC2 to S3:

File size - smaller files require a larger number of requests and more overhead and transfer slower. The gain with filesize (when originating from EC2) is negligible for files larger than 256kB. (Whereas, transfering from a remote location, with higher latency, tends to continue showing appreciable improvements until between 1MiB and 2MiB).
Number of parallel threads - a single upload thread usually has a fairly low throughout - often below 5MiB/s. Throughput increases with the number of concurrent threads, and tends to peak between 64 and 128 threads. It should be noted that larger instances are able to handle a greater number of concurrent threads.
Instance size - As per the instance specifications, larger instances have more dedicated resources, including a larger (and less variable) allocation of network bandwidth (and I/O in general - including reading from ephemeral/EBS disks - which are network attached. Typical numbers values for each category are:
- Very High: Theoretical: 10Gbps = 1250MB/s; Realistic: 8.8Gbps = 1100MB/s
- High: Theoretical: 1Gbps = 125MB/s; Realistic: 750Mbps = 95MB/s
- Moderate: Theoretical: 250Mbps; Realistic: 80Mbps = 10MB/s
- Low: Theoretical: 100Mbps; Realistic: 10-15Mbps = 1-2MB/s

In cases of transferring large amounts of data, it may be economically practical to use a cluster compute instance, as the effective gain in throughput (>10x) is more than the difference in cost (2-3x).

While the above ideas are fairly logical (although, the per-thread cap may not be), it is quite easy to find benchmarks backing them up. One particularly detailed one can be found here.

Using between 64 and 128 parallel (simultaneous) uploads of 1MB objects should saturate the 1Gbps uplink that an m1.xlarge has and should even saturate the 10Gbps uplink of a cluster compute (cc1.4xlarge) instance.

While it is fairly easy to change instance size, the other two factors may be harder to manage.

File size is usually fixed - we cannot join files together on EC2 and have them split apart on S3 (so, there isn't much we can do about small files). Large files however, we can split apart on the EC2 side and reassemble on the S3 side (using S3's multi-part upload). Typically, this is advantageous for files that are larger than 100MB.
Parallel threads is a bit harder to cater to. The simplest approach comes down to writing a wrapper for some existing upload script that will run multiple copies of it at once. Better approaches use the API directly to accomplish something similar. Keeping in mind that the key is parallel requests, it is not difficult to locate several potential scripts, for example:
- s3cmd-modification - a fork of an early version of s3cmd that added this functionality, but hasn't been updated in several years.
- s3-parallel-put - reasonably recent python script that works well

Related Solutions

The best way to copy large files from the workstation to a Windows Server based EC2 instance

If you don't care about security you could simply setup FTP. If you wanted to be more secure you could use sFTP (secure FTP) or SCP (Secure Copy).

Not a perfect answer; neither of those options are as simple as you might prefer as they require a bit of setup time.

update: To copy files through RDP you'll need to setup your RDP clinet so you can see your local resources. It's in the Options>Local Resource Tab>Local Devices/Drives area.

You won't be able to naviate to them via command line; you can use xcopy the command line. e.g. xcopy \tsclient\C\FilesToBeCopied*.* C:\DestinationFolder\

Switching from Amazon EC2 instance-store to EBS Volume

Basically you just need to copy the running instance to am EBS volume. Before doing this stop any services which change things on the filesystem (mysql, etc...)

So create a volume, make sure it's in the same availability zone as your s3 backed instance, and attach it to that instance.

ec2-create-volume -s 10 -z us-east-1d
ec2-attach-volume -i i-instance_id -d /dev/sdh

Copy everything over to the ebs volume and validate.

dd bs=65536 if=/dev/sda1 of=/dev/sdh
fsck /dev/sdh

Then mount the drive

mkdir -p 000 /ebs
mount /dev/sdh /ebs

make sure /ebs/etc/fstab wont try and mount anything that's not there, then unmount the drive

umount /dev/sdh

You can then create a snapshot of that volume, then you can ec2-register it as an ami, you have to do this from the command line, I don't think you can register an ami from a snapshot using the web interface.

Best Answer

Related Solutions

The best way to copy large files from the workstation to a Windows Server based EC2 instance

Switching from Amazon EC2 instance-store to EBS Volume

Related Topic