Freebsd – Why does zfs fail to cache this workload when “normal” filesystems cache it completely

freebsdzfszfsonlinux

Update: because recordsize defaults to 128k, the amount of data read by the test program
is much larger than the ARC on an 8GB system and still slightly larger than the ARC on a 16GB system. Reducing the recordsize allows less data to be read and therefore it fits in the ARC. I was underestimating the size of the data being read, the effect of recordsize on that, and therefore drawing some poor conclusions. So far, disabling prefetch doesn't seem to make much difference in this case although I am going to try all the recordsize options with and without prefetch enabled.

This load is similar to an IMAP/Maildir scenario with many directories, many files, and potentially only small amounts of data read from a each file.

I have tested using FreeBSD 10 and Fedora 19 with zfsonlinux. I have
tested various linux native filesystems, like extX/xfs/jfs and even
btrfs. On FreeBSD I tested using the native ufs filesystem as well.
My workload is simply scanning a largish music collection using
amarok/winamp/etc. My test program is amarok_collectionscanner
because it can be run from the command line easily. The pattern is
always the same. An initial run of the collection scanner takes
around 10 minutes depending on the filesystem but ZFS performs
similarly to non-ZFS filesystems

Subsequent runs of a scan are incredibly fast using a non-zfs
filesystem, usually around 30 seconds. ZFS makes only marginal
improvements with subsequent runs. It's also obvious from watching
iostat that after an initial run on an non-ZFS filesystem the OS
doesn't touch the disk. It's all in the filesystem cache.

Using an SSD cache for ZFS improves the time, but it never gets
anywhere near 30 seconds.

Why doesn't ZFS cache this load? One possibility I explored was that
the size of the ARC was limited to less than what a non-ZFS filesystem
is allowed to use for caching. I tested again on a machine with more
memory available to the ARC than the entire available memory on the
first test system and the numbers stayed the same.

I hope to find/create a fio recipe that duplicates this kind of
load. Basically it would need to create thousands of smallish files,
scan all the directories looking for the files, open each file and
read a small amount of data from each one. It is like the world's
worst database! I will probably test OpenIndiana next but I expect
the results to be the same.

The data set is 353GB, and 49,000 files. The test systems had 8GB-16GB of RAM.
The zpool configuration made very little difference but the tests I care about were always just one whole disk. I used ST3500630AS and WDC WD20EZRX-00D8PB0 among other drives.
The drives made almost no difference. The amount of RAM or the speed of the CPUs
made little or no difference. Only the filesystem in use changed the results appreciably and those differences were quite substantial as I noted above.
I actually have mountains of data points regarding the various filesystem parameters I tried and these are some of the variables I checked:
mdadm raid configurations (0 and 1)
zpool configurations, mirror and stripe
zfs recordsize
mdadm chunk size
filesystem blocksize

On a single ST3500630AS drive I got these numbers for the default filesystem options
for the following filesystems. This was on Fedora 19, 8GB of RAM, 3.11.10-200 kernel, ZFS 0.6.2-1. The values are in seconds. Subsequent scans were run without any attempt to clear the cache.

ZFS: 900, 804, 748, 745, 743, 752, 741
btrfs: 545, 33, 31, 30, 31, 31
ext2: 1091, 30, 30, 30, 30, 30...
ext3: 1014, 30, 30, 30, 30, 30...
ext4: 554, 31, 31, 32, 32, 31, 31...
jfs: 454, 31, 31,31,31...
xfs: 480, 32, 32, 32, 32 ,31 ,32, etc.

On FreeBSD 10, single drive WD20EZRX-00D8PB0, faster machine, 16GB of memory, ARC allowed to grow to 12GB:

ufs: 500, 18, 18...
zfs: 733, 659, 673, 786, 805, 657

Although the the above variables sometimes had an effect on the initial cold cache scan of
the data, it's the subsequent runs that look all the same. Standard filesystems cache it all and therefore as long as nothing else thrashes the cache perform lightning fast. ZFS doesn't exhibit that behaviour.

Best Answer

Start with disabling atime if not already done.

You might also investigate setting primarycache=metadata impact.