Forcing CloudFront to pass-through the latest HTML file from S3

amazon s3amazon-cloudfrontamazon-web-services

Background

I'm hosting a static site on S3, with CloudFront over the top. The issue I have is with my HTML files.

According to CloudFront's FAQ:

Amazon CloudFront uses these cache control headers to determine how
frequently it needs to check the origin for an updated version of that
file

What I've done so far

With this in mind I've set the HTML files in my S3 Bucket to add in the following headers:

Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Expires: Fri, 01 Jan 1990 00:00:00 GMT

On my first call to my samplefile.htm, I see the following response headers (I've excluded obvious headers (e.g. Content-Type) in order to keep to the point:

Cache-Control:no-cache, no-store, max-age=0, must-revalidate
Date:Sat, 10 Dec 2011 14:16:51 GMT
ETag:"a5890ace30a3e84d9118196c161aeec2"
Expires:Fri, 01 Jan 1990 00:00:00 GMT
Last-Modified:Sat, 10 Dec 2011 14:16:43 GMT
Server:AmazonS3
X-Cache:Miss from cloudfront

As you can see, my Cache-Control header is in there. The problem is, if I update this file and refresh I get the cached content (rather than the latest file), and I can see that CloudFront is serving its cached version by looking at the response headers:

X-Cache:Hit from cloudfront

Summary/question

With the above in mind, how can I achieve automatic retrieval of the latest HTML when using CloudFront?

As per its FAQ I should be able to do this with Cache-Control headers, but I can't seem to get this working.

Following the answers below

In the end I decided to change my www CNAME to point to my S3 bucket directly. Then added a new CNAME called "static", which points to CloudFront.

This means that HTML is direct from S3, which then has all its CSS/JS/IMG references pointing to static.mydomain.com

Best Answer

Firstly, the point of Cloudfront is to serve cached content - if you try to serve uncached content from Cloudfront it is slower than serving it directly from S3, in almost all cases (something like streaming content would be the exception). Consider for a moment what needs to happen to serve content from Cloudfront - it needs to be retrieved from the origin server to a location that is geographically close to the user - which means that for a request where Cloudfront has to retrieve content from the origin server, you add extra latency into the request, and the user receives content slower. It is only once the content is available at the edge location that subsequent requests are faster.

The best approach to this problem is to change your filenames when you update a page - this will force Cloudfront to retrieve the new content. Again, keep in mind that Cloudfront is typically used for media files (including images) and style/javascript - and not so much for html. Esssentially, you would have your HTML on S3, and your images on Cloudfront - with any changes you make, you can change the name of the file on Cloudfront (e.g. file-v1.jpg, file-v2.jpg, etc). Another common way is including a query string with version information.

Also, keep in mind that Cloudfront does not serve gzipped content - which may result in a slower response than from a regular server (although, in your case, S3 doesn't identify gzip capable browsers either).

Finally, if you want to, you can use invalidation to force Cloudfront to discard its existing copy and fetch a new one from the origin server. Note, however, that Cloudfront gives you only 1000 free invalidations per month, after which the cost is $0.005/invalidation.

The lowest time Cloudfront will keep content is 1hr, although, the default is 24hr. I'd therefore try to set the max-age to at least 3600. Consider also an s-maxage header (for shared - i.e. proxied content). Amazon recommends this caching tutorial.

There was a recent problem with this, rectified a few days ago