Priming a ZFS L2ARC cache on Solaris 11.3

solariszfszfs-l2arc

Is there a good way to prime a ZFS L2ARC cache on Solaris 11.3?

The L2ARC is designed to ignore blocks that have been read sequentially from a file. This makes sense for ongoing operation but makes it hard to prime the cache for initial warm-up or benchmarking.

In addition, highly-fragmented files may benefit greatly from sequential reads being cached in the L2ARC (because on-disk they are random reads), but with the current heuristics these files will never get cached even if the L2ARC is only 10% full.

In previous releases of Solaris 10 and 11, I had success in using dd twice in a row on each file. The first dd read the file into the ARC, and the second dd seemed to tickle the buffers so they became eligible for L2ARC caching. The same technique does not appear to work in Solaris 11.3.

I have confirmed that the files in question have an 8k recordsize, and I have tried setting zfs_prefetch_disable but this had no impact on the L2ARC behaviour UPDATE: zfs_prefetch_disable turns out to be important, see my answer below.

If there is no good way to do it, I would consider using a tool that generates random reads over 100% of a file. This might be worth the time given that the cache is persistent now in 11.3. Do any tools like this exist?

Best Answer

With a bit of experimentation I've found four possible solutions.

With each approach, you need to perform the steps and then continue to read more data to fill up the ZFS ARC cache and to trigger the feed from the ARC to the L2ARC. Note that if the data is already cached in memory, or if the compressed size on disk of each block is greater than 32kB, these methods won't generally do anything.

1. Set the documented kernel flag zfs_prefetch_disable

The L2ARC by default refuses to cache data that has been automatically prefetched. We can bypass this by disabling the ZFS prefetch feature. This flag is often a good idea for database workloads anyway.

echo "zfs_prefetch_disable/W0t1" | mdb -kw

..or to set it permananently, add the following to /etc/system:

set zfs:zfs_prefetch_disable = 1

Now when files are read using dd, they will still be eligible for the L2ARC.

Operationally, this change also improves the behaviour of reads in my testing. Normally, when ZFS detects a sequential read it balances the throughput among the data vdevs and cache vdevs instead of just reading from cache -- but this hurts performance if the cache devices are significantly lower-latency or higher-throughput than the data devices.

2. Re-write the data

As data is written to a ZFS filesystem it is cached in the ARC and (if it meets the block size criteria) is eligible to be fed into the L2ARC. It's not always easy to re-write data, but some applications and databases can do it live, e.g. through application-level file mirroring or moving of the data files.

Problems:

  • Not always possible depending on the application.
  • Consumes extra space if there are snapshots in use.
  • (But on the bright side, the resulting files are defragmented.)

3. Unset the undocumented kernel flag l2arc_noprefetch

This is based on reading the OpenSolaris source code and is no doubt completely unsupported. Use at your own risk.

  1. Disable the l2arc_noprefetch flag:

    echo "l2arc_noprefetch/W0" | mdb -kw
    

    Data read into the ARC while this flag is disabled will be eligible for the L2ARC even if it's a sequential read (as long the blocks are at most 32k on disk).

  2. Read the file from disk:

    dd if=filename.bin of=/dev/null bs=1024k
    
  3. Re-enable the l2arc_noprefetch flag:

    echo "l2arc_noprefetch/W1" | mdb -kw
    

4. Read the data randomly

I wrote a Perl script to read files in 8kB chunks pseudorandomly (based on the ordering of a Perl hash). It may also work with larger chunks but I haven't tested that yet.

#!/usr/bin/perl -W

my $BLOCK_SIZE = 8*2**10;
my $MAX_ERRS = 5;

foreach my $file (@ARGV) {
        print "Reading $file...\n";
        my $size;
        unless($size = (stat($file))[7]) {print STDERR "Unable to stat file $file.\n"; next; }
        unless(open(FILE, "<$file")) {print STDERR "Unable to open file $file.\n"; next; }
        my $buf;
        my %blocks;
        for(my $i=0;$i<$size/$BLOCK_SIZE;$i++) { $blocks{"$i"} = 0; }
        my $errs = 0;
        foreach my $block (keys %blocks) {
                unless(sysseek(FILE, $block*$BLOCK_SIZE, 0) && sysread(FILE, $buf, $BLOCK_SIZE)) {
                        print STDERR "Error reading $BLOCK_SIZE bytes from offset " . $block * $BLOCK_SIZE . "\n";
                        if(++$errs == $MAX_ERRS) { print STDERR "Giving up on this file.\n"; last; }
                        next;
                }
        }
        close(FILE);
}

Problems:

  • This takes a long time and puts a heavy workload on the disk.

Remaining issues

  • The above methods will get the data into main memory, eligible for feeding into the L2ARC, but they don't trigger the feed. The only way I know to trigger writing to the L2ARC is to continue reading data to put pressure on the ARC.
  • On Solaris 11.3 with SRU 1.3.9.4.0, only rarely does the L2ARC grow the full amount expected. The evict_l2_eligible kstat increases even when the SSD devices are under no pressure, indicating that data is being dropped. This remaining rump of uncached data has a disproportionate effect on performance.