How to actually exclude a directory in AWS S3 sync

amazon s3aws-cli

The aws s3 sync command has an --exclude flag which lets you exclude a folder from the sync. However, even though the files are not uploaded from that directory, the command still looks at and processes all the files in that folder. The reason I wanted to exclude that folder in the first place was because it is a very large folder containing a lot of data, with the data I actually want to sync being just a few MB in the parent folder and a few other subfolders. However, it takes several minutes to sync those few MB, because of the several GB of data in that data subfolder. Is there a way I can actually exclude (e.g. from even being looked at or processed) that subfolder so that the sync command completes in a reasonable amount of time?

Best Answer

I think this may be a case of mismatched expectations regarding what functionality S3 provides.

S3 does not actually have any structure, the bucket just has a flat set of objects with the full string that might be seen as the "path" being the key of each object.
The ListObjectsV2 API action however provides features like specifying a prefix (only returns objects that have a key that starts with some particular string) and the option of specifying a delimiter (splits keys by the provided delimiter and groups repeating key segments) that allow you to present the contents of a bucket as if it had structure (like what the AWS Console does, for instance).

The aws s3 sync utility presumably also starts working from the normal ListObjectsV2 API action, but this API does not have any functionality equivalent to the --exclude (or --include) options in the sync utility, only the option of getting the list filtered by key prefix.
Hence it would appear that the sync utility has to do the processing of those more flexible filtering options on the client side as it processes the full list of objects for the specified prefix, which will never really be efficient if there is a high number of objects under the specified prefix which are supposed to be skipped.

What you want to do in your scenario is probably to instead specify the prefix or prefixes that you want instead of specifying a more generic prefix and filtering what you don't want. If what you want is not identifiable by prefix, you may want to consider changing your naming so that there is some known prefix that you can specify. (Or possibly even using separate buckets for different types of data, if that makes more senes for your situation.)

4. Checking referrer

The referrer is provided by the client. Trusting the client-provided authentication/authorization data pretty much voids security (I can just claim to have been sent from where you expect me to come from).
Verdict: TERRIBAD idea - trivial to bypass.

3. Download the files through our server

Not a bad idea, as long as you're willing to spend the bandwidth to make it happen, and your server is reliable.
Going on the assumption that you've already solved the security problem for your normal server/app, this is the most secure of the options you've presented.
Verdict: Good solution. Very secure, but possibly suboptimal if bandwidth is a factor.

2. Obfuscated URLs

Security Through Obscurity? Really? No.
I'm not even going to analyze it. Just no.
Verdict: If #4 was TERRIBAD this is TERRIWORSE because people don't even have to go through the effort of forging a referrer header. Guess the string and win ~~a prize~~all the data!

1. Generating (expiring) signed urls with PHP

This option has a pretty low suck quotient.
Anyone can click on the URL and snarf the data, which is a security no-no, but you mitigate this by making the link expire (as long as the link life is short enough the vulnerability window is small).
The URL expiring may inconvenience some users who want to hang on to the download link for a long time, or who don't get the link in a timely manner -- that's a bit of a User Experience suck, but it may be worth it.
Verdict: Not as good as #3, but if bandwidth is a major concern it's certainly better than #4 or #2.

What would I do?

Given these options, I would go with #3 -- Pass the files through your own front-end server, and authenticate the way your app normally does. Assuming your normal security is pretty decent this is the best option from a security standpoint.
Yes, this means more bandwidth use on your server, and more resources playing middleman -- but you can always just charge a tiny bit more for that.

Best Answer

Related Solutions

Amazon S3 – How to Get the Size of an Amazon S3 Bucket

Storing and serving files securely for multiple clients

4. Checking referrer

3. Download the files through our server

2. Obfuscated URLs

1. Generating (expiring) signed urls with PHP

What would I do?

Related Topic