The best filesystem for storing thousands of files in one dictionary-like id-blob structure

filesystemsstorage

What filesystem best suits my needs?

Thousands or even millions of files in one directory.
Good (ext4 & ntfs level or close) reliability (incl. fault tolerance) and access speed.
No directories actually needed, as well as descriptive names, just a dictionary-like structure of id-blob pairs is all I need.
No links, attributes, and access control features needed.
Having storage-level closed-key data encryption (so that if the hard drive is lost or stolen, no one can make any use of its content) would be cool.

The purpose is a file storage where all the metadata (data describing all the facts about what the file actually contains and who can access it) is stored in a MySQL database.

As far as I know common filesystems like NTFS and ext3/4 can go dead-slow if there are too many files placed in one directory – that's why I ask.

Best Answer

The problem is not so much the file system (NTFS actually is ok with some hundred thousands) - it is all the tools around. Don't even DARE opening the directory in windows explorer. Shell scripts will take ages if a dir returns 2 million etc.

Better in a folder hierarchy.

Give every file a 16 byte hex code

Make folders / File names in 4 char segments,

So a file would maybe be in affc/2548/2224.... etc.

Keeps the directories shorter

AND you may be able to implement mount points here (though a 4 symbol file level is too wide for that).

DO not forget, too, you need to possibly backup/restore that

Related Solutions

Linux – the best Linux filesystem for MySQL (InnoDB)

How much do you value the data?

Seriously, each filesystem has its own tradeoffs. Before I go much further, I am a big fan of XFS and Reiser both, although I often run Ext3. So there isn't a real filesystem bias at work here, just letting you know...

If the filesystem is little more than a container for you, then go with whatever provides you with the best access times.

If the data is of any significant value, you will want to avoid XFS. Why? Because if it can't recover a portion of a file that is journaled it will zero out the blocks and make the data un-recoverable. This issue is fixed in Linux Kernel 2.6.22.

ReiserFS is a great filesystem, provided that it never crashes hard. The journal recovery works fine, but if for some reason you loose your parition info, or the core blocks of the filesystem are blown away, you may have a quandry if there are multiple ReiserFS partitions on a disk - because the recovery mechanism basically scans the entire disk, sector by sector, looking for what it "thinks" is the start of the filesystem. If you have three partitions with ReiserFS but only one is blown, you can imagine the chaos this will cause as the recovery process stitches together a Frankenstein mess from the other two systems...

Ext3 is "slow", in a "I have 32,000 files and it takes time to find them all running ls" kinda way. If you're going to have thousands of small temporary tables everywhere, you will have a wee bit of grief. Newer versions now include an index option that dramatically cuts down the directory traversal but it can still be painful.

I've never used JFS. I can only comment that every review of it I've ever read has been something along the lines of "solid, but not the fastest kid on the block". It may merit investigation.

Enough of the Cons, let's look at the Pros:

XFS:

screams with enormous files, fast recovery time
very fast directory search
Primitives for freezing and unfreezing the filesystem for dumping

ReiserFS:

Highly optimal small-file access
Packs several small files into same blocks, conserving filesystem space
fast recovery, rivals XFS recovery times

Ext3:

Tried and true, based on well-tested Ext2 code
Lots of tools around to work with it
Can be re-mounted as Ext2 in a pinch for recovery
Can be both shrunk and expanded (other filesystems can only be expanded)
Newest versions can be expanded "live" (if you're that daring)

So you see, each has its own quirks. The question is, which is the least quirky for you?

Storing a million images in the filesystem

I'd recommend using a regular file system instead of databases. Using file system is easier than a database, you can use normal tools to access files, file systems are designed for this kind of usage etc. NTFS should work just fine as a storage system.

Do not store the actual path to database. Better to store the image's sequence number to database and have function that can generate path from the sequence number. e.g:

 File path = generatePathFromSequenceNumber(sequenceNumber);

It is easier to handle if you need to change directory structure some how. Maybe you need to move the images to different location, maybe you run out of space and you start storing some of the images on the disk A and some on the disk B etc. It is easier to change one function than to change paths in database.

I would use this kind of algorithm for generating the directory structure:

First pad you sequence number with leading zeroes until you have at least 12 digit string. This is the name for your file. You may want to add a suffix:
- 12345 -> 000000012345.jpg
Then split the string to 2 or 3 character blocks where each block denotes a directory level. Have a fixed number of directory levels (for example 3):
- 000000012345 -> 000/000/012
Store the file to under generated directory:
- Thus the full path and file filename for file with sequence id 123 is 000/000/012/00000000012345.jpg
- For file with sequence id 12345678901234 the path would be 123/456/789/12345678901234.jpg

Some things to consider about directory structures and file storage:

Above algorithm gives you a system where every leaf directory has maximum of 1000 files (if you have less that total of 1 000 000 000 000 files)
There may be limits how many files and subdirectories a directory can contain, for example ext3 files system on Linux has a limit of 31998 sub-directories per one directory.
Normal tools (WinZip, Windows Explorer, command line, bash shell, etc.) may not work very well if you have large number of files per directory (> 1000)
Directory structure itself will take some disk space, so you'll do not want too many directories.
With above structure you can always find the correct path for the image file by just looking at the filename, if you happen to mess up your directory structures.
If you need to access files from several machines, consider sharing the files via a network file system.
The above directory structure will not work if you delete a lot of files. It leaves "holes" in directory structure. But since you are not deleting any files it should be ok.

Best Answer

Related Solutions

Linux – the best Linux filesystem for MySQL (InnoDB)

Storing a million images in the filesystem

Related Topic