Best storage servers infrastructure ? DAS/NAS/SAN or installing GlusterFS/LUSTER/HDFS/RBDB

cloud-storagedirect-attached-storagenetwork-attached-storage

I am trying to design an infrastucture for the project I am working on. It would be somehow a file-sharing/downloading project (like rapidshare) and I would need high storage sizes and good scability, and I would add new storage nodes after my project grows up.

I have come up with 3 solutions for my project which are using Luster, GlusterFS, HDFS, RDBD.

For start, i would have 2 servers, one server is for glusterfs client + webserver + db server+ a streaming server, and the other server is gluster storage node. (After sometime, i would be adding more node servers, and client servers (dont know how many new client new servers to add, will see later)

So, i am thinking to work with glusterfs. But i really wonder that if i have to use high performance servers with high sotrage sizes or avarage/slow servers with high storage sizes? Or nas/das/san solutions are better for glusterfs storage nodes? I might buy a nas and install glusterfs onto it. I would be happy to listen to your recommendations for the server properties (for each clients and nodes) . I really dont know if I really need high amount of ram and good cpus to for the nodes. I am sure i need it for client servers.

The files would be streamed as well, so the Automatic file replication is important, thus, my system should work like a cloud, when needed, according to high traffic, the storage nodes should copy the most demanded file to be streamed and would help me to get rid of scability problems and my visitors would able to stream/download those files.

Also, i am open to your experiences/thoughts about any good solution. Luster, hdfs, rbdb are the other options and i would be happy to listen to your thoughts here. I would be very very happy to hear back from anyone commented of any words I have used here.

Thanks


Edit:

I know the IOPS is the critical variable that i have to count on in every calculation if my network design, thats why i say random requests. But unfortunately, i dont have any statistics at all. Thats why i am here 🙂

My project is like that, you enter a download url to my website, my url downloads it, and you start download it from my own server, like a proxy downloader.

So i have a server 100mbit connection and 2TB hdd for now. I am thinking add nas servers. Really dont know if i have to add duplicated storage nodes in nas. And is there a limit that i can connect nas devices ? i mean i can connect max 2 nas servers to my main server?

Best Answer

Your questions are non trivial and there is not enough info to give a good answer. I can give an answer (clustered filesystem over fibre channel SAN) - but it may well turn out to be more expensive and complex than it needs to be.

So I'll just throw out some comments/thoughts. Really stuff for you to consider. Perhaps after reading this brain dump, you'll be able to restate your app's intended behaviour and maybe then we can give you a better answer.

NAS devices export file systems (e.g. CIFS, NFS), so you don't really connect them to your servers - your servers mount file systems from them. That means reads and writes to them need to go over your a connection. So if you have a 100mbit network connection between your NAS and your server and your read/writes occur at a 1:1 ratio, then the best you'll get is 50mbit reads, because for every byte you read, you also write a byte. If your client and download traffic are on that same network then you can halve it again. Clearly if you want to use a NAS then you going to want multiple NICs in your servers and multipel networks/VLANs in your architecure.

Assuming there are 4 possible data locations in your app.

  • A) Orignal data source, e.g. internet.
  • B) Your server.
  • C) NAS.
  • D) client List item

Then there are 4 possible data vectors

  • AB i.e. the data download from A(the net) to B(your server).
  • BC i.e. writing data from your server to the NAS.
  • CB reading data from the NAS to your server
  • BD writing data from your server to the client

Depending on how your app works and ignoring protocol overhead you may (worst case) then need 4 100mbit networks to transport 100mbit per second to your clients.

So you'll need to consider the read and the write bandwidth to the NAS if you use a NAS. If you use a FC SAN you can reduce your network needs and you get other advatangges.

E.g. Depending on OS and the filesystem you end up using, a SAN will allow you to grow your LUNs dynamically and grow your filessyems live as well as share the LUNs wth more hosts, again potentially as a live operation.

You can reduce the cost of the SAN by not using fibre channel, e.g. you could use iSCSI. In which case you'll again want separate networks for your data and you'll want dedicated NICs, ideally with tcp/iSCSI offload hardware. That will give you most of the advatnges of a SAN with lower cost.

I have not really used iSCSI exxcept for the most basic single LUN to a single host, with simple linux LVM and ext3, so I am not 100% sure if it si really as good as FC SAN, but I gather it can be if well implemented.

SAN arrays are probably the better choice if you are going to use a clustered filesystem. The question is do you really need a clustered file system? That will depend on the characteristics of your app and your architecture.

Now if your app can guarantee that only node node will write to a given file at a given time, then you can probably go to a NAS. But you may have problems if you modify a file with one host while it is being read with another host, so your app would need to detect and deal with that scenario. If that is a scenario that you dont want to bother with then a clustered file system is probably better choice - they are designed to work with that sort of scenario.

So questions like some of these listed below might make a big difference to your architecture:

  • Does the file need to be reused again after it has been downloaded once and sent to the client - i.e. might it be re-read from storage and served to another client?
  • Does a file need to be written completely to storage before it is sent to a client?
  • Can a file be stored on local disk on the server and served to the client from local disk, and then be written to NAS/SAN after it has been served to the client?
  • Are multiple clients likely to be using the same file at once? E.g. is it likely that 50 clients will access one file or 50 clients will access 50 different files.
  • If 50 clients each request the same file, will it be downloaded once or 50 times?
  • If another client comes along 3 hours later and requests the same file will the file be downloaded again or will it come from disk?
  • Is the disk a cache or a slow buffer?
  • Will there be any other processing performed on the file before it is returned to eh clients, e.g. security scanned, have URLs re-written etc.

Given the limited info we have I'd say the safest architecture is the most expensive and complex architecture as that will deal with most of the worst case problems and be very scalable. I.e. Fibre channel SAN and clustered file system.

In all cases whatever your storage, DAS, SAN, NAS, all other things being equal more spindles are better.