Priming a ZFS L2ARC cache on Solaris 11.3

solariszfszfs-l2arc

Is there a good way to prime a ZFS L2ARC cache on Solaris 11.3?

The L2ARC is designed to ignore blocks that have been read sequentially from a file. This makes sense for ongoing operation but makes it hard to prime the cache for initial warm-up or benchmarking.

In addition, highly-fragmented files may benefit greatly from sequential reads being cached in the L2ARC (because on-disk they are random reads), but with the current heuristics these files will never get cached even if the L2ARC is only 10% full.

In previous releases of Solaris 10 and 11, I had success in using dd twice in a row on each file. The first dd read the file into the ARC, and the second dd seemed to tickle the buffers so they became eligible for L2ARC caching. The same technique does not appear to work in Solaris 11.3.

I have confirmed that the files in question have an 8k recordsize, ~~and I have tried setting zfs_prefetch_disable but this had no impact on the L2ARC behaviour~~ UPDATE: zfs_prefetch_disable turns out to be important, see my answer below.

If there is no good way to do it, I would consider using a tool that generates random reads over 100% of a file. This might be worth the time given that the cache is persistent now in 11.3. Do any tools like this exist?

Best Answer

With a bit of experimentation I've found four possible solutions.

With each approach, you need to perform the steps and then continue to read more data to fill up the ZFS ARC cache and to trigger the feed from the ARC to the L2ARC. Note that if the data is already cached in memory, or if the compressed size on disk of each block is greater than 32kB, these methods won't generally do anything.

1. Set the documented kernel flag zfs_prefetch_disable

The L2ARC by default refuses to cache data that has been automatically prefetched. We can bypass this by disabling the ZFS prefetch feature. This flag is often a good idea for database workloads anyway.

echo "zfs_prefetch_disable/W0t1" | mdb -kw

..or to set it permananently, add the following to /etc/system:

set zfs:zfs_prefetch_disable = 1

Now when files are read using dd, they will still be eligible for the L2ARC.

Operationally, this change also improves the behaviour of reads in my testing. Normally, when ZFS detects a sequential read it balances the throughput among the data vdevs and cache vdevs instead of just reading from cache -- but this hurts performance if the cache devices are significantly lower-latency or higher-throughput than the data devices.

2. Re-write the data

As data is written to a ZFS filesystem it is cached in the ARC and (if it meets the block size criteria) is eligible to be fed into the L2ARC. It's not always easy to re-write data, but some applications and databases can do it live, e.g. through application-level file mirroring or moving of the data files.

Problems:

Not always possible depending on the application.
Consumes extra space if there are snapshots in use.
(But on the bright side, the resulting files are defragmented.)

3. Unset the undocumented kernel flag l2arc_noprefetch

This is based on reading the OpenSolaris source code and is no doubt completely unsupported. Use at your own risk.

Disable the l2arc_noprefetch flag:
```
echo "l2arc_noprefetch/W0" | mdb -kw
```
Data read into the ARC while this flag is disabled will be eligible for the L2ARC even if it's a sequential read (as long the blocks are at most 32k on disk).

Read the file from disk:

dd if=filename.bin of=/dev/null bs=1024k

Re-enable the l2arc_noprefetch flag:
```
echo "l2arc_noprefetch/W1" | mdb -kw
```

4. Read the data randomly

I wrote a Perl script to read files in 8kB chunks pseudorandomly (based on the ordering of a Perl hash). It may also work with larger chunks but I haven't tested that yet.

#!/usr/bin/perl -W

my $BLOCK_SIZE = 8*2**10;
my $MAX_ERRS = 5;

foreach my $file (@ARGV) {
        print "Reading $file...\n";
        my $size;
        unless($size = (stat($file))[7]) {print STDERR "Unable to stat file $file.\n"; next; }
        unless(open(FILE, "<$file")) {print STDERR "Unable to open file $file.\n"; next; }
        my $buf;
        my %blocks;
        for(my $i=0;$i<$size/$BLOCK_SIZE;$i++) { $blocks{"$i"} = 0; }
        my $errs = 0;
        foreach my $block (keys %blocks) {
                unless(sysseek(FILE, $block*$BLOCK_SIZE, 0) && sysread(FILE, $buf, $BLOCK_SIZE)) {
                        print STDERR "Error reading $BLOCK_SIZE bytes from offset " . $block * $BLOCK_SIZE . "\n";
                        if(++$errs == $MAX_ERRS) { print STDERR "Giving up on this file.\n"; last; }
                        next;
                }
        }
        close(FILE);
}

Problems:

This takes a long time and puts a heavy workload on the disk.

Remaining issues

The above methods will get the data into main memory, eligible for feeding into the L2ARC, but they don't trigger the feed. The only way I know to trigger writing to the L2ARC is to continue reading data to put pressure on the ARC.
On Solaris 11.3 with SRU 1.3.9.4.0, only rarely does the L2ARC grow the full amount expected. The evict_l2_eligible kstat increases even when the SSD devices are under no pressure, indicating that data is being dropped. This remaining rump of uncached data has a disproportionate effect on performance.

Related Solutions

ZFS – zpool ARC cache plus L2ARC benchmarking

It seems your tests are very sequential like writing a large file with dd then reading it. ZFS L2ARC cache is designed to boost performance on random reads workloads, not for streaming like patterns. Also, to get optimal performance, you might want to wait a longer time until the cache is warm. Another point would be to make sure your working set fit into the SSDs. Having io statistics observed during the tests would help figuring out what devices are used and how they perform.

Solaris ZFS volumes: workload not hitting L2ARC

You should have more RAM in the system. Pointers to L2ARC need to be kept in RAM (ARC), so I think you'd need around 4GB or 6GB of RAM to better utilize the ~60GB of L2ARC you have available.

This is from a recent thread on the ZFS list:

http://opensolaris.org/jive/thread.jspa?threadID=131296

L2ARC is "secondary" ARC. ZFS attempts to cache all reads in the ARC 
(Adaptive Read Cache) - should it find that it doesn't have enough space 
in the ARC (which is RAM-resident), it will evict some data over to the 
L2ARC (which in turn will simply dump the least-recently-used data when 
it runs out of space). Remember, however, every time something gets 
written to the L2ARC, a little bit of space is taken up in the ARC 
itself (a pointer to the L2ARC entry needs to be kept in ARC). So, it's 
not possible to have a giant L2ARC and tiny ARC. As a rule of thumb, I 
try not to have my L2ARC exceed my main RAM by more than 10-15x (with 
really bigMem machines, I'm a bit looser and allow 20-25x or so, but 
still...). So, if you are thinking of getting a 160GB SSD, it would be 
wise to go for at minimum 8GB of RAM. Once again, the amount of ARC 
space reserved for a L2ARC entry is fixed, and independent of the actual 
block size stored in L2ARC. The jist of this is that tiny files eat up 
a disproportionate amount of systems resources for their size (smaller 
size = larger % overhead vis-a-vis large files).

Best Answer

Related Solutions

ZFS – zpool ARC cache plus L2ARC benchmarking

Solaris ZFS volumes: workload not hitting L2ARC

Related Topic