How to partition directory system for GlusterFS

glusterfs

We have 3 folders on an Ubuntu 14.04 machine, with each one containing 250K pictures with the size of 2KB-30KB in each folder, expecting to grow until 1M files per directory.

While trying to scale the Application to several servers we are looking into Glusterfs for a shared storage. As 250K files are not a problem on ext4 it seems to be problematic for glusterfs. Trying to copy the files crashes the machine entirely.

I am look to partition the files into directories in 2 levels:

mkdir -p {000..255}/{000..255}

/000/000/filename
/001/000/filename
/001/001/filename
...

Does this sound like a good feasonable way? The entire structure will contain millions of files later on. Would this allow glusterfs to be reliable in production with a good performance, hosting millions of files?

Best Answer

Using GlusterFS to store and access lots and lots of very small files is a difficulty many implementations face, and it seems you're already on a good path to solve the problem: breaking the files up into separate directories.

You could implement a solution like that. Just create a bunch of directories, choose a limit for how many files can go in each directory, and hope you don't run out of places to place files. In your example you're creating 65k+ directories, so that's not likely to be a problem any time soon.

Another option is to create directories based on the date a file is created. For example if the file cust_logo_xad.png was created today it would be stored here:

/gluster/files/2015/08/24/cust_logo_xad.png

If you're hosting data for different entities (customers, departments, etc) you could separate files based on ownership, assigning the entity a unique ID of some sort. For example:

/gluster/files/ry/ry7eg4k/cust_logo_xad.png

Beyond that it would be a good idea to take a look at the GlusterFS documentation for tuning the storage cluster for hosting small files. At the very least make sure that:

  1. The file systems on the GlusterFS storage servers have enough free inodes available (mkfs option)
  2. The drives on the GlusterFS storage servers can handle lots of IOPs.
  3. You use an appropriate file system for the task (either ext4 or xfs)
  4. Your application / staff doesn't try to scan directories with lots of small files frequently.

If you can (and if you haven't already) it's a good idea to create a database to act as an index for the files, rather than having to scan (e.g. ls) or search (e.g. find) for files all of the time.