ZFS on linux – L2ARC not being read

centos7zfsonlinux

Today I did some tests on L2ARC using the latest ZFS on Linux 0.7.10. I have seen that the L2ARC gets filled with data, but with the default module settings the data that is residing in L2ARC cache is never touched. Instead the data is read from the vdevs of the main pool. I also have seen this behaviour in 0.7.9 and I am not sure if that is the expected behaviour.
Even if that would be the expected behaviour, I think it is odd to spoil the L2ARC with data that is never read.

The test installation is a VM:

CentOS 7.5 with latest patches
ZFS on Linux 0.7.10
2GB RAM

I did some ZFS settings:

l2arc_headroom=1024 and l2arc_headroom=1024 to speed up the L2ARC population

Here is how the pool was created and the layout. I know it is rather odd for a real-world setup, but this was intended for L2ARC testing only.

[root@host ~]# zpool create tank raidz2 /dev/sda /dev/sdb /dev/sdc cache sdd -f
[root@host ~]# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  2.95G   333K  2.95G         -     0%     0%  1.00x  ONLINE  -
  raidz2  2.95G   333K  2.95G         -     0%     0%
    sda      -      -      -         -      -      -
    sdb      -      -      -         -      -      -
    sdc      -      -      -         -      -      -
cache      -      -      -         -      -      -
  sdd  1010M    512  1009M         -     0%     0%

Now write some data to a file and look at the device usage.

[root@host ~]# dd if=/dev/urandom of=/tank/testfile bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 9.03607 s, 59.4 MB/s

[root@host ~]# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  2.95G  1.50G  1.45G         -    10%    50%  1.00x  ONLINE  -
  raidz2  2.95G  1.50G  1.45G         -    10%    50%
    sda      -      -      -         -      -      -
    sdb      -      -      -         -      -      -
    sdc      -      -      -         -      -      -
cache      -      -      -         -      -      -
  sdd  1010M   208M   801M         -     0%    20%

Alright, some of the data was already moved to the L2ARC but not all. So, read it in some more times to make it in L2ARC completely.

[root@host ~]# dd if=/tank/testfile of=/dev/null bs=512 # until L2ARC is populated with the 512MB testfile

[root@host ~]# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  2.95G  1.50G  1.45G         -    11%    50%  1.00x  ONLINE  -
  raidz2  2.95G  1.50G  1.45G         -    11%    50%
    sda      -      -      -         -      -      -
    sdb      -      -      -         -      -      -
    sdc      -      -      -         -      -      -
cache      -      -      -         -      -      -
  sdd  1010M   512M   498M         -     0%    50%

Okay, L2ARC is populated and ready to be read. But one needs to get rid of L1ARC first. I did the following, which have seemed to work.

[root@host ~]# echo $((64*1024*1024)) > /sys/module/zfs/parameters/zfs_arc_max; sleep 5s; echo $((1024*1024*1024)) > /sys/module/zfs/parameters/zfs_arc_max; sleep 5s; arc_summary.py -p1

------------------------------------------------------------------------
ZFS Subsystem Report                        Sun Sep 09 17:03:55 2018
ARC Summary: (HEALTHY)
    Memory Throttle Count:                  0

ARC Misc:
    Deleted:                                20
    Mutex Misses:                           0
    Evict Skips:                            1

ARC Size:                           0.17%   1.75    MiB
    Target Size: (Adaptive)         100.00% 1.00    GiB
    Min Size (Hard Limit):          6.10%   62.48   MiB
    Max Size (High Water):          16:1    1.00    GiB

ARC Size Breakdown:
    Recently Used Cache Size:       96.06%  1.32    MiB
    Frequently Used Cache Size:     3.94%   55.50   KiB

ARC Hash Breakdown:
    Elements Max:                           48
    Elements Current:               100.00% 48
    Collisions:                             0
    Chain Max:                              0
    Chains:                                 0

Alright, now we are ready to read from the L2ARC (sorry for the long preface, but I thought it was important).
So running the dd if=/tank/testfile of=/dev/null bs=512 command again, I was watching zpool iostat -v 5 in a second terminal.

To my surprise, the file was read from the normal vdevs instead of the L2ARC, although the file sits in L2ARC. This is the only file in the filesystem and no other activity is active during my tests.

              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.50G  1.45G    736     55  91.9M  96.0K
  raidz2    1.50G  1.45G    736     55  91.9M  96.0K
    sda         -      -    247     18  30.9M  32.0K
    sdb         -      -    238     18  29.8M  32.0K
    sdc         -      -    250     18  31.2M  32.0K
cache           -      -      -      -      -      -
  sdd        512M   498M      0      1  85.2K  1.10K
----------  -----  -----  -----  -----  -----  -----

I then did fiddle around with some settings like zfetch_array_rd_sz, zfetch_max_distance, zfetch_max_streams, l2arc_write_boost and l2arc_write_max, setting them to an odd high number. But nothing did change.

After changing

l2arc_noprefetch=0 (default is 1)
or zfs_prefetch_disable=1 (default is 0)
toggle both from their defaults

the reads are served from the L2ARC. Again running dd if=/tank/testfile of=/dev/null bs=512 and watching zpool iostat -v 5 in a second terminal and get rid of L1ARC.

[root@host ~]# echo 0 > /sys/module/zfs/parameters/l2arc_noprefetch 
[root@host ~]# echo $((64*1024*1024)) > /sys/module/zfs/parameters/zfs_arc_max; sleep 5s; echo $((1024*1024*1024)) > /sys/module/zfs/parameters/zfs_arc_max; sleep 5s; arc_summary.py -p1
...
[root@host ~]# dd if=/tank/testfile of=/dev/null bs=512

And the result:

              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.50G  1.45G      0     57    921   102K
  raidz2    1.50G  1.45G      0     57    921   102K
    sda         -      -      0     18      0  34.1K
    sdb         -      -      0     18      0  34.1K
    sdc         -      -      0     19    921  34.1K
cache           -      -      -      -      -      -
  sdd        512M   497M    736      0  91.9M   1023
----------  -----  -----  -----  -----  -----  -----

Now data is read from L2ARC, but only after toggling the module parameters mentioned above.

I also have read that L2ARC can be sized too big. But threads I have found about that topic were referring to performance problems or the space map for the L2ARC spoiling the L1ARC.
Performance is not my problem here, and as far as I can tell the space map for the L2ARC is also not that big.

[root@host ~]# grep hdr /proc/spl/kstat/zfs/arcstats 
hdr_size                        4    279712
l2_hdr_size                     4    319488

As already mentioned, I am not sure if that is the intended behavior or if I am missing something.

Best Answer

So after reading up on this topic, mainly this post, it seems this is the default behaviour of ZFS.

What happens is that the file makes its way into L1ARC after being read and due to blocks being accessed it is considered to be put into L2ARC.
Now on a second read of the file, ZFS is doing prefetching on the file, which bypasses the L2ARC although the blocks of the file are stored in L2ARC.

By disabling prefetching completely with zfs_prefetch_disable=1 or tell ZFS to do prefetching on L2ARC with l2arc_noprefetch=0, reads will make use of the blocks of the file residing in L2ARC.
This might be desired if your L2ARC is large enough compared to the file sizes that are being reading.
But one might want to only put metadata into the L2ARC with zfs set secondarycache=metadata tank. This prevents big files ending up in L2ARC and never being read. Since this would spoil the L2ARC and might evict blocks of smaller files not being prefetched and metadata, which you want to keep in L2ARC.

I haven't found a way to tell ZFS to put only small files into the L2ARC and not merging the prefetch candidates into L2ARC. So for now, depending on file sizes and L2ARC size one has to make the tradeoff.
A different approach seems to be available in the ZoL 0.8.0 release, where it is possible to use different Allocation Classes and should make it possible to e.g. put your metadata on fast SSDs, while leaving data blocks on slow rotating disks. This will still leave the contention small files vs. big files for L2ARC, but will solve the fast access on metadata issue.

Best Answer

Related Solutions

Freebsd – Why does zfs fail to cache this workload when “normal” filesystems cache it completely

Debian – ZFS on Linux – Debian – sharenfs datasets – can’t write, only read

Related Topic