Linux – Slow performance due to txg_sync for ZFS 0.6.3 on Ubuntu 14.04

linuxperformance-tuningzfs

I am using native ZFS with "ZFS on Linux" installed from the PPA here. Setup was not a problem and I am using it in mirrored configuration with two WD 4TB Red HDDs. Unfortunately I am having performance issues, when writing to the disk-array. When reading performance is OK.

I am having the problem, that during large writes to the array, the copy process stalls to ~5-10MB/s every ~5 seconds as reported by rsync. The speeds in-between stalls is ~75MB/s, which is inline with other filesystems and what I would expect from the system (I tried btrfs, which gets ~85MB/s). Looking at iotop I have found that the copy-stalls coincide with the process txg_sync performing/hogging I/O. This issue appears to be the issue of "bursty" I/O that seems to be a common issue with ZFS (see here and here). I have applied the option from the first link

options zfs zfs_prefetch_disable=1

which helped somewhat with the performance issues, but did not solve them. The 5s interval of txg_sync appears to be that of vfs.zfs.txg.timeout="5" (e.g. 5s), which is the default setting of ZFS on Linux.

Is this normal behaviour or are there other settings can I try? If so, any suggestions? Note that I couldn't find many of the options in both links…

EDIT 2: To follow up a little: The system I am using is a HP ProLiant Microserver N36L, which I upgraded to 8GB ECC RAM. The commands I used for creating the ZFS volume is given here. Note that I am using -o ashift=12 as I found (found on the zfsonlinux FAQ) that this should get ZFS to play nice with the 4096Byte blocks of Advanced Format Disks.

$ zpool create -o ashift=12 -m /zpools/tank tank mirror ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0871252 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E3PKP1R0
$ zfs set relatime=on tank
$ zfs set compression=lz4 tank
$ zfs create -o casesensitivity=mixed tank/data

Added the zfs_prefetch_disable option to /etc/modprob.d/zfs.conf to make changes permanent:

options zfs zfs_prefetch_disable=1

So that:

$ cat /sys/module/zfs/parameters/zfs_prefetch_disable 
1

EDIT 1: As requested, I added the zpool get all output. Note that I forgot to mention that I turned on compression on the pool…

$ zpool get all
NAME  PROPERTY               VALUE                  SOURCE
tank  size                   3.62T                  -
tank  capacity               39%                    -
tank  altroot                -                      default
tank  health                 ONLINE                 -
tank  guid                   12372923926654962277   default
tank  version                -                      default
tank  bootfs                 -                      default
tank  delegation             on                     default
tank  autoreplace            off                    default
tank  cachefile              -                      default
tank  failmode               wait                   default
tank  listsnapshots          off                    default
tank  autoexpand             off                    default
tank  dedupditto             0                      default
tank  dedupratio             1.00x                  -
tank  free                   2.21T                  -
tank  allocated              1.42T                  -
tank  readonly               off                    -
tank  ashift                 12                     local
tank  comment                -                      default
tank  expandsize             0                      -
tank  freeing                0                      default
tank  feature@async_destroy  enabled                local
tank  feature@empty_bpobj    active                 local
tank  feature@lz4_compress   active                 local

Best Answer

Pacoman, It seems that because you have two two WD-RED drives in a mirror, the IO to write the ZIL consistency group to disk is causing high IO. There is always a ZIL (Write-Cache). If you do not have any LOG devices, then the log device is on the pool itself, and can be as large as maximum write speed * 5 seconds. Your probably reading from the ZIL, and committing the data to permanent storage every 5 seconds. Questions:

  1. Do you have a SLOG device? This is ideally a DRAM Drive (HGST ZeusRAM, etc...).
  2. Do you have any cache devices to read from? Ideally, a bunch of Flash, like a 480GB PCIe card.

My recommendation would be to create a SLOG somewhere other than the pool (even the boot device is better than no where, assuming it NOT flash). This way you aren't reading and writing to the mirror intensively every 5 seconds.