ZFS copy on write

zfszfsonlinux

I am planning to use ZFS for backups. 5-10 servers will "stream" updates via DRBD to very large files (500 gigabytes each) on ZFS file system.

The servers will generate about 20 megabytes per second of random writes about 100 MBps total. I don't read these files so the pattern should be almost 100% writes.

For me copy on write is a very important feature.

As i understand COW should transform random writes to sequential writes. But this is not happening.

I tested on a server with 12 SAS drives E5520 XEON (4 core) and 24 GB RAM and random write was very low.

I decided to debug it first on 1 SAS HDD on the same server.

I created EXT4 file system and did some tests:

 
root@zfs:/mnt/hdd/test# dd if=/dev/zero of=tempfile bs=1M count=4096 conv=fdatasync,notrunc
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 30.2524 s, 142 MB/s

So I can see write speed is about 140 MBps.

Random writes ~ 500 KBps ~100-150 iops. Which is normal.

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=1 --size=4G --readwrite=randwrite
test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.1.11
Starting 1 process
bs: 1 (f=1): [w(1)] [0.6% done] [0KB/548KB/0KB /s] [0/137/0 iops] [eta 02h:02m:57s]

Then on the same drive I created ZFS:

zpool create -f -m /mnt/data bigdata scsi-35000cca01b370274

I set record size 4K because I will have 4K random writes. Record size 4K worked better than 128k when I was testing.

zfs set recordsize=4k bigdata

Tested random writes to 4G files.

fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=./test --filename=test --bs=4k --iodepth=1 --size=4G --readwrite=randwrite
./test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.1.11
Starting 1 process
./test: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/115.9MB/0KB /s] [0/29.7K/0 iops] [
[eta 00m:00s]

Looks like COW did well here 115.9MB per sec.

Tested random writes to 16G files.

fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test --filename=./test16G --bs=4k --iodepth=1 --size=16G --readwrite=randwrite

test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.1.11
Starting 1 process
bs: 1 (f=1): [w(1)] [0.1% done] [0KB/404KB/0KB /s] [0/101/0 iops] [eta 02h:08m:55s]

Very poor results 400 kilobytes per second.

Tried the same with 8G files:

fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test --filename=./test8G --bs=4k --iodepth=1 --size=8G --readwrite=randwrite

test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.1.11
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 8192MB)

bs: 1 (f=1): [w(1)] [14.5% done] [0KB/158.3MB/0KB /s] [0/40.6K/0 iops] [eta 00m:53s]

At the beginning COW was fine 136 megabytes per second.

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00 1120.00     0.00   136.65   249.88     9.53    8.51    0.00    8.51   0.89  99.24

But at the end when test reached 90% write speed went down to 5 megabyte per second.

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00  805.90     0.00     5.33    13.54     9.95   12.34    0.00   12.34   1.24 100.00

So 4G files are fine, 8G almost fine but 16G files are not getting any COW.

Don't understand what is happening here. Maybe memory caching plays role here.

OS: Debian 8
ZFS ver 500.
No compression or deduplication.


zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
bigdata  1.81T  64.4G  1.75T         -     2%     3%  1.00x  ONLINE  -


root@zfs:/mnt/data/test# zdb
bigdata:
    version: 5000
    name: 'bigdata'
    state: 0
    txg: 4
    pool_guid: 16909063556944448541
    errata: 0
    hostid: 8323329
    hostname: 'zfs'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 16909063556944448541
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 8547466691663463210
            path: '/dev/disk/by-id/scsi-35000cca01b370274-part1'
            whole_disk: 1
            metaslab_array: 34
            metaslab_shift: 34
            ashift: 9
            asize: 2000384688128
            is_log: 0
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data



zpool status bigdata
  pool: bigdata
 state: ONLINE
  scan: none requested
config:

    NAME                      STATE     READ WRITE CKSUM
    bigdata                   ONLINE       0     0     0
      scsi-35000cca01b370274  ONLINE       0     0     0
errors: No known data errors

fio doesn't work with O_DIRECT on ZFS I had to run without it. As I understand it should produce even better results. But it is not happening.

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=./test --filename=test16G --bs=4k --iodepth=1 --size=16G --readwrite=randwrite
./test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.1.11
Starting 1 process
fio: looks like your file system does not support direct=1/buffered=0
fio: destination does not support O_DIRECT

Best Answer

This isn't a fair fight:

  • You are writing zeros when you're using dd whereas fio defaults to generating a single random block and reusing it when possible. Since all zeroes is (slightly) more compressible this can skew numbers.
  • The block size being used by your dd is 1MByte whereas the size being used by your fio line is 4KBytes.
  • You are limiting the iodepth of your fio run to only 1 (this choice compounds the smaller than dd block size choice above).
  • Your dd is always allowed to use the writeback cache but one of your fio runs was not (because you were setting direct=1)!
  • Because your version of ZFS doesn't "allow" O_DIRECT even if you were to try and use a higher depth with libaio the submission wouldn't be asynchronous (see SSD IOPS on linux, DIRECT much faster than buffered, fio ) but as the maximum outstanding I/O is limited to one (see above) this is a moot issue.
  • Your dd is writing so little data (4GBytes) in comparison to the size of your memory (24GBytes) that Linux could have significant amounts of write data still in its page cache when it finishes. You would have to do at least some sort of file sync to ensure it had really reached non-volatile storage... As pointed out in the comments conv=fdatasync has been set on dd so there will be a final fdatasync before it exits which ensures the data isn't only in non-volatile caches.
  • You're comparing sequential on an Ext4 volume to random on a ZFS volume (you changed workload AND filesystems between comparisons).

At the bear minimum I'd suggest starting over, do all your tests on ZFS, use fio for both the sequential and random tests and leaving the rest of the line the same. I'd also consider using something like end_fsync to ensure data had actually hit the disk and wasn't just in non-volatile caches (but I can see an argument for skipping this bit).

TLDR; I'm afraid your comparison methodology is flawed - perhaps it's better to change less between comparisons?