Performance associated with storing millions of files on NTFS

performancestorage

Does anyone have a method / formula, etc that I could use – hopefully based on both current and projected numbers of files – to project the 'right' length of the split and the number of nested folders?

Please note that although similar it isn't quite the same as Storing a million images in the filesystem. I'm looking for a way to help make the theories outlined more generic.

Assumptions

  • I have 'some' initial number of files. This number would be arbitrary but large. Say 500k to 10m+.
  • I have considered the underlying physical hardware disk IO requirements that would be necessary to support such an endeavor.

Put another way

As time progresses this store will grow. I want to have the best balance of current performance and as my needs increase. Say I double or triple my storage. I need to be able to address both current needs and projected future growth. I need to both plan ahead and not sacrifice too much of current performance.

What I've come up with

I'm already thinking about using a hash split every so many characters to split things out across multiple directories and keeping the trees even, very similar as outlined in the comments in the question above. It also avoids duplicate files, which would be critical over time.

I'm sure that the initial folder structure would be different based on what I've outlined, and depending on the initial scale. As far as I can figure there isn't a one size fits all solution here. It would be horrendously time intensive to work something out experimentally.

Best Answer

Some years ago I started writing a storage system similar to ceph. Then I discovered ceph and what they had worked better so I dumped my development.

During the development process I asked a similar question to yours but on SA I did a lot of calculation on handling lots of small files and found that naming files (assuming they can be anything) by uuid and splitting it 3 levels deep was ample for my needs.

From memory I used the first 3 letters to form the top level, then the next 3 to form level 2 and then used the whole uuid for the file name.

My calculation was based on the number of files I wanted and amount of data per drive stored and what the limits were for the filesystem type.

For a UUID, if you use the hex version you get A-Z, a-z, 0-9 so 26+26+9 or 61. For 3 levels deep that is 61*61*61 = 226,981. I figured 226k directory combinations is ample. For XFS this is fine. But for NTFS I'm not sure. So you had better find out what the real limits are. Just listing that many directories by opening up explorer might cause your server to grind somewhat. So you may want to come up with a scheme that doesn't have as many folders at the top level. Perhaps using a single letter and go 4 levels deep or something.