The AWS CLI now supports the --query
parameter which takes a JMESPath expressions.
This means you can sum the size values given by list-objects
using sum(Contents[].Size)
and count like length(Contents[])
.
This can be be run using the official AWS CLI as below and was introduced in Feb 2014
aws s3api list-objects --bucket BUCKETNAME --output json --query "[sum(Contents[].Size), length(Contents[])]"
This is really bordering on "Do my system architecture" for you, but your four ideas are interesting case-studies in variable security, so let's run your options and see how they fare:
4. Checking referrer
The referrer is provided by the client. Trusting the client-provided authentication/authorization data pretty much voids security (I can just claim to have been sent from where you expect me to come from).
Verdict: TERRIBAD idea - trivial to bypass.
3. Download the files through our server
Not a bad idea, as long as you're willing to spend the bandwidth to make it happen, and your server is reliable.
Going on the assumption that you've already solved the security problem for your normal server/app, this is the most secure of the options you've presented.
Verdict: Good solution. Very secure, but possibly suboptimal if bandwidth is a factor.
2. Obfuscated URLs
Security Through Obscurity? Really? No.
I'm not even going to analyze it. Just no.
Verdict: If #4 was TERRIBAD this is TERRIWORSE because people don't even have to go through the effort of forging a referrer header. Guess the string and win a prizeall the data!
1. Generating (expiring) signed urls with PHP
This option has a pretty low suck quotient.
Anyone can click on the URL and snarf the data, which is a security no-no, but you mitigate this by making the link expire (as long as the link life is short enough the vulnerability window is small).
The URL expiring may inconvenience some users who want to hang on to the download link for a long time, or who don't get the link in a timely manner -- that's a bit of a User Experience suck, but it may be worth it.
Verdict: Not as good as #3, but if bandwidth is a major concern it's certainly better than #4 or #2.
What would I do?
Given these options, I would go with #3 -- Pass the files through your own front-end server, and authenticate the way your app normally does. Assuming your normal security is pretty decent this is the best option from a security standpoint.
Yes, this means more bandwidth use on your server, and more resources playing middleman -- but you can always just charge a tiny bit more for that.
Best Answer
I think this may be a case of mismatched expectations regarding what functionality S3 provides.
S3 does not actually have any structure, the bucket just has a flat set of objects with the full string that might be seen as the "path" being the key of each object.
The ListObjectsV2 API action however provides features like specifying a prefix (only returns objects that have a key that starts with some particular string) and the option of specifying a delimiter (splits keys by the provided delimiter and groups repeating key segments) that allow you to present the contents of a bucket as if it had structure (like what the AWS Console does, for instance).
The
aws s3 sync
utility presumably also starts working from the normal ListObjectsV2 API action, but this API does not have any functionality equivalent to the--exclude
(or--include
) options in the sync utility, only the option of getting the list filtered by key prefix.Hence it would appear that the sync utility has to do the processing of those more flexible filtering options on the client side as it processes the full list of objects for the specified prefix, which will never really be efficient if there is a high number of objects under the specified prefix which are supposed to be skipped.
What you want to do in your scenario is probably to instead specify the prefix or prefixes that you want instead of specifying a more generic prefix and filtering what you don't want. If what you want is not identifiable by prefix, you may want to consider changing your naming so that there is some known prefix that you can specify. (Or possibly even using separate buckets for different types of data, if that makes more senes for your situation.)