Linux – Distributed, Parallel, Fault-Tolerant File System

filesystemslinux

There are so many choices that it's hard to know where to start. My requirements are these:

  • Runs on Linux
  • Most of the files will be between 5-9 MB in size. There will also be a significant number of small-ish jpgs (100px x 100px).
  • All of the files need to be available over http.
  • Redundancy — ideally it would provide the space efficiency similar to RAID 5 of 75% (in RAID 5 this would be calculated thus: with 4 identical disks, 25% of the space is used for parity => 75% efficent)
  • Must support several petabytes of data
  • scalable
  • runs on commodity hardware

In addition, I look for these qualities, though they are not "requirements":

  • Stable, mature file system
  • Lots of momentum and support
  • etc

I would like some input as to which file system works best for the given requirements. Some people at my organization are leaning towards MogileFS, but I'm not convinced of the stability and momentum of that project. GlusterFS and Lustre, based on my limited research, appear to be better supported…

Thoughts?

Best Answer

If it were me, I would be using GlusterFS. The current release is pretty solid and I know people at some very large installations in both the HPC and Internet space that are relying on it in their production systems. You can basically tailor it to your needs by laying out the components as you need them. Unlike Lustre, there are no dedicated metadata servers so central points of failure are minimized, and it's easier to scale the setup.

Unfortunately I don't think there's an easy way to meet your 75% criteria without throwing performance down the drain.

It does run on commodity hardware, however the performance really shines when using Infiniband interconnect. Fortunately the price of IB is really quite low these days.

You might want to check out the guys at Scalable Informatics and their Jackrabbit products as a solution. They support GlusterFS on their hardware, and the price of their solution certainly rivals the cost of putting something together from scratch.