We have 3 folders on an Ubuntu 14.04 machine, with each one containing 250K pictures with the size of 2KB-30KB in each folder, expecting to grow until 1M files per directory.
While trying to scale the Application to several servers we are looking into Glusterfs for a shared storage. As 250K files are not a problem on ext4 it seems to be problematic for glusterfs. Trying to copy the files crashes the machine entirely.
I am look to partition the files into directories in 2 levels:
mkdir -p {000..255}/{000..255}
/000/000/filename
/001/000/filename
/001/001/filename
...
Does this sound like a good feasonable way? The entire structure will contain millions of files later on. Would this allow glusterfs to be reliable in production with a good performance, hosting millions of files?
Best Answer
Using GlusterFS to store and access lots and lots of very small files is a difficulty many implementations face, and it seems you're already on a good path to solve the problem: breaking the files up into separate directories.
You could implement a solution like that. Just create a bunch of directories, choose a limit for how many files can go in each directory, and hope you don't run out of places to place files. In your example you're creating 65k+ directories, so that's not likely to be a problem any time soon.
Another option is to create directories based on the date a file is created. For example if the file
cust_logo_xad.png
was created today it would be stored here:If you're hosting data for different entities (customers, departments, etc) you could separate files based on ownership, assigning the entity a unique ID of some sort. For example:
Beyond that it would be a good idea to take a look at the GlusterFS documentation for tuning the storage cluster for hosting small files. At the very least make sure that:
mkfs
option)If you can (and if you haven't already) it's a good idea to create a database to act as an index for the files, rather than having to scan (e.g.
ls
) or search (e.g.find
) for files all of the time.