Linux Disk I/O – How to Minimize Reads for File Access

filesystemshard drivelinuxperformance

According to this paper on Facebook's Haystack:

"Because of how the NAS appliances manage directory metadata, placing
thousands of files in a directory was extremely inefficient as the
directory’s blockmap was too large to be cached effectively by the
appliance. Consequently it was common to incur more than 10 disk
operations to retrieve a single image. After reducing directory sizes
to hundreds of images per directory, the resulting system would still
generally incur 3 disk operations to fetch an image: one to read the
directory metadata into memory, a second to load the inode into
memory, and a third to read the file contents.
"

I had assumed the filesystem directory metadata & inode would always be cached in RAM by the OS and a file read would usually require just 1 disk IO.

Is this "multiple disk IO's to read a single file" problem outlined in that paper unique to NAS appliances, or does Linux have the same problem too?

I'm planning to run a Linux server for serving images. Any way I can minimize the number of disk IO – ideally making sure the OS caches all the directory & inode data in RAM and each file reads would only require no more than 1 disk IO?

Best Answer

This depends on the filesystem being used. Some filesystems are better at the large-directory problem than others are, and yes caching does impact usage.

Older versions of EXT3 had a very bad problem handling directories with thousands of files in them, which was fixed when dir_indexes were introduced. If a dir_index is not used, retrieving a file out of a directory with thousands of files can be quite expensive. Without knowing the details, I suspect that's what the NAS device in the article was using.

Modern filesystems (the latest ext3, ext4, xfs) handle the large-dir problem a lot better than in olden days. Some of the inodes can get large, but the b-trees in common usage for indexing the directories make for very speedy fopen times.