Setting up centralized storage (e.g., NAS/SAN) for a SaaS application

cdncloudnetwork-attached-storagesaasstorage-area-network

I have a SaaS application running on 6+ servers in HPCloud which creates large amounts of data (GB/TB). Users talk to the application through a RESTful api which replies back with a link to our CDN where they can download their file.

My questions:

From my research and previous question here on SF, storing all generated data on some kind of centralized storage (e.g., via NAS/SAN) would be the best solution so my CDN always knows where the files are to serve up — which would also enable better scaling in the future. Since I'm on a cloud similar to Rackspace, what are my options in doing this?
For my own reference, how do companies like mediafire store TB/PB of data and LB their downloads at the same time? Do they just have tons of servers connecting to the same NAS/SAN?

UPDATE

Data requested by Ablue:

Are you creating files to be served by http? Yes these files will be primarily downloaded through HTTP

Do you need block level storage? Not currently, but in the future this may be the case

HOW MUCH STORAGE DO YOU WANT? Currently I could get away with having ~300GB, but I'd need to be able to scale out in the future

What sort of access speeds do you want/need? The faster the better for writing but read times don't matter as much. Main thing here, is that using a system like S3 increases the latency because of how long it can take to copy data

Do you have a budget? Yes/No… for the cloud I'm in I can basically spin up 3-5 more servers with around 120GB storage each

Best Answer

TL;DR

1) On a cloud, not that many cheap options unless you want to go for an S3-like system. With a centralized system, you can only scale so far before you start running into issues (See scaling up vs. scaling out) so if you are rolling your own solution you'd probably be best starting off with a distributed system that lets you add and remove servers on demand, rather than just getting a big SAN and keep adding disks.

2) They will almost certainly use dedicated hardware, co-located or in private datacenters. If you go to a storage provider and say "hey, I want to buy 2000 disks" they'll give you some pretty decent discounts if you know what you're doing. Storing 100TB of data will always be cheaper (Per GB) than storing 100GB, the more you store the cheaper it gets.

Have a look into a distributed data store like HFS or Riak. Never used HFS but we're using a Riak cluster on 4 nodes with 10TB of storage. RIAK has a HTTP API so with a little bit of careful configuration you can just point your CDN to your Riak cluster. Alternatively just use S3, RackSpace cloud files, Google Storage etc. and let someone else worry about that for you. Since pre-existing storage providers are already on multi-TB/PB of storage, they can most likely do it cheaper than you would be able to roll your own.

That being said, BackBlaze (Online backup company) "open sourced" the designs for their storage "pods" which store ridiculous amounts of data very cheaply. They are more suited to "write once, sit there doing nothing for years" as is the nature of backups, but it's still an interesting read. You could also look into something like the BroadBerry storage servers, their top end model has 36 hot-swap drive bays but costs +$5k without drives (Filling it with 2TB enterprise 7200RPM drives you're looking at more like $25k, or with cheap drives $15k, that entirely depends on your workload). OVH provide some "backup" servers with ~20TB of un-RAIDed storage for around £200/mo if I remember correctly.

You also need to think about tiered storage. Basically, this means you split your data up into "tiers" based on what you need. If some of your objects must be kept at all costs, and need to be accessed quickly, they should be on top, or "gold" tier storage with fast, reliable disks, on servers well equip to handle the load. This might be the sort of thing you would put on a high-end SAN with lots of lovely SAS or even SSD disks. If you have some objects which are re-generatable and don't need to be accessed quickly (Say, thumbnails for images that are normally cached on CDN edges), you can put those on "silver" tier storage; cheaper disks, on slower servers. Then you have your backups, while you may never need them, and they might not need to be available immediately, you want to keep them for as long as possible, as cheaply as possible. You might put those on "bronze" storage, like tapes.

The storage levels I described are for a purely fictional situation, it's entirely possible to have 50 tiers of storage, and you can call them whatever you want. It might be that even your lowest tier of storage requires super-fast access, that all depends on your usage.

UPDATE

Best Answer

TL;DR

Related Solutions

What Makes Cloud Storage (Amazon AWS, Microsoft Azure, google Apps) different from Traditional Data center storage networking (SAN and NAS)

Best storage servers infrastructure ? DAS/NAS/SAN or installing GlusterFS/LUSTER/HDFS/RBDB

Related Topic