So, here are a few points that may help you out.
- Provider: For most of the front end side of things, if cost is your main factor then it is up to you to find out what provider suits your needs. Reliability, Cost, and Scaling are all factors that you will need to consider.
- Note, unless you have the user download some kind of client side program (Flash, JS etc) your servers will have to receive the file and then upload it to S3 for them. This will induce a lot of load as well as bandwidth costs. However, it will also give you much better control over 'what' can be uploaded and how. Once you hand control over to the client you will not be able to truly control what gets uploaded.
- S3 is great for storing static content and it will be key in creating a site like this and keeping costs in line. Make sure you properly control who has upload permissions to which buckets. For example, if you have css and javascript in one bucket, only you should be able to upload to that location, otherwise a malicious user could upload some nasty files to replace your content. On the other side if you are going to allow the user to upload content directly to save on bandwidth, you will have to make sure that is a separate bucket, ideally per user. This is not trivial to enforce, and nearly impossible if you provide the client direct upload access.
Depending on your upload configuration (Client Side Client vs Server Side Client) your needs will be different. Client Side will be cheaper up front for server costs, but be aware that someone will probably find a way to store any kind of file and you will be responsible for moderating that content. For the Server Side model, be prepared to have your server costs increase with user traffic as you will need to build out more servers to handle upload requests.
Once you have the content hosted you will also want to look into a CDN (Content Delivery Network) such as Amazon's CloudFront (if you want to stay on the Amazon stack) or Akamai Networks. These will increase your costs at first, but save you money on high usage content.
Amazon SimpleDB is an interesting Database style. It is 'eventually consistent' which means that data sent to the database may not be immediately accessible, similar to Amazon S3. If you are going use the database as a way to keep data synced across multiple nodes for many realtime transactions, I would not recommend it.
Is GlusterFS a good solution to share data across all the instances in
the Autoscaling group?
Possibly.. The only way you'll get a definitive answer is with your own tests, however. In the past, I've set up a 4 node webserver cluster on Linode instances, using GlusterFS to distribute/share the assets directory of images and so on.
We found 2 main problems with this approach:
- GlusterFS is pretty IO intensive, and works really well on hardware with uncontended IO
- Occasionally, a Linode server would experience less-than-optimal access to the backend SAN, and IO-Wait time would go up dramatically. When this happened, Gluster would copy more data between the remaining nodes, which caused IO performance to suffer on those nodes in turn. The result of this was that a minor IO blip, caused by suboptimal SAN configuration, or timesharing would mean that the entire webserver cluster would go poot, and the entire shared filesystem might become unavailable.
Purely anecdotal evidence, but I'd not run GlusterFS on a virtual machine with SAN/shared storage ever again.
Does Gluster ensure there is no loss of data ?
It can... In Gluster 3.0, there's a better recognition of "replication pools" where you can define how many copies of the data exists throughout the cluster. Setting a replication level of 2, means that there's 2 copies on the entire cluster.. This effectively halves your storage capacity, but means that you've got greater resilience to node failure.
Importantly, it also means that you have to add more nodes as multiples of the replication level, in this case, pairs of nodes.
What will happen if all the instances in Autoscaling are terminated,
will I lose user data ?
If the instances are only using ephemeral instance storage, yes. If they're EBS based, or using mounted EBS instances, then no.
What happens if a user uploads a image and the server processing the
request goes down ?
That greatly depends on how your application is designed. I strongly suspect that the user would lose their data (almost certain in a naively architected solution.)
Is there an impact on IO if clients go down ?
See above.. If the client goes down because of backend storage problems, it can easily destroy the performance of the cluster entirely.
Best Answer
S3 is fast enough for most use-cases. Even Amazon serves images from S3.
Having said that... Consider putting CloudFront CDN in front of the bucket and serve the images through CloudFront.
It will cache them and reduce traffic from S3.
It will bring the images closer to your visitors in different regions because CloudFront has Points of presence in over 160 datacentres around the world.
It communicates with S3 over the internal AWS network which is often faster than public internet.
So yes, CloudFront is the answer. Hope that helps :)