AWS S3 Sync options duplicates / weird behaviour

amazon s3amazon-web-services

I am trying to sync a folder with 2M+ files to S3, everything went OK but then 40,000 files were not uploaded (server crashed randomly), when I tried to do a sync command again, it started from 0, even if we have 2M-40K images on S3, it re-uploads the 2M images, having "duplicates".

Why I say "duplicates"? Because when I did a list before the re-sync on S3, it said that I had a -40K files difference, when I did after some minutes the re-sync it said it has +80K difference, how is it possible that it has +80K files than origin? duplicates/versioning/history

So I'm trying to upload only the missing 40k files, because thoose files are at the end of the folder, so if it starts over, it must wait another day to upload the same 2M files…

I hope I've explained it correctly.

TL;DR: A broken sync command for 2M files against S3 didn't uploaded 40K files, how can I upload only theese 40k files and not the 2M?

Best Answer

Your scenario sounds like exactly what the s3 sync tool is supposed to be used for. I would think "aws s3 sync local_directory s3://your_bucket_location" should work exactly like you're asking.

Are you using the AWS CLI tools? If so, can you try with --dry-run and let us know if it thinks the difference is ~40k or it is actually all 2M+ files?

EDIT: s3 sync docs, just in case. http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

Related Solutions

AWS CodeBuild script fails s3 sync with AccessDenied

Adding the following to the CodeBuild generated role worked for me:

{
    "Effect": "Allow",
    "Resource": [
        "arn:aws:s3:::mytestbucket",
        "arn:aws:s3:::mytestbucket/*"
    ],
    "Action": [
        "s3:PutObject",
        "s3:Get*",
        "s3:List*"
    ]
}

How to actually exclude a directory in AWS S3 sync

I think this may be a case of mismatched expectations regarding what functionality S3 provides.

S3 does not actually have any structure, the bucket just has a flat set of objects with the full string that might be seen as the "path" being the key of each object.
The ListObjectsV2 API action however provides features like specifying a prefix (only returns objects that have a key that starts with some particular string) and the option of specifying a delimiter (splits keys by the provided delimiter and groups repeating key segments) that allow you to present the contents of a bucket as if it had structure (like what the AWS Console does, for instance).

The aws s3 sync utility presumably also starts working from the normal ListObjectsV2 API action, but this API does not have any functionality equivalent to the --exclude (or --include) options in the sync utility, only the option of getting the list filtered by key prefix.
Hence it would appear that the sync utility has to do the processing of those more flexible filtering options on the client side as it processes the full list of objects for the specified prefix, which will never really be efficient if there is a high number of objects under the specified prefix which are supposed to be skipped.

What you want to do in your scenario is probably to instead specify the prefix or prefixes that you want instead of specifying a more generic prefix and filtering what you don't want. If what you want is not identifiable by prefix, you may want to consider changing your naming so that there is some known prefix that you can specify. (Or possibly even using separate buckets for different types of data, if that makes more senes for your situation.)

Best Answer

Related Solutions

AWS CodeBuild script fails s3 sync with AccessDenied

How to *actually* exclude a directory in AWS S3 sync

Related Topic

How to actually exclude a directory in AWS S3 sync