I am planning to use ZFS for backups. 5-10 servers will "stream" updates via DRBD to very large files (500 gigabytes each) on ZFS file system.
The servers will generate about 20 megabytes per second of random writes about 100 MBps total. I don't read these files so the pattern should be almost 100% writes.
For me copy on write is a very important feature.
As i understand COW should transform random writes to sequential writes. But this is not happening.
I tested on a server with 12 SAS drives E5520 XEON (4 core) and 24 GB RAM and random write was very low.
I decided to debug it first on 1 SAS HDD on the same server.
I created EXT4 file system and did some tests:
root@zfs:/mnt/hdd/test# dd if=/dev/zero of=tempfile bs=1M count=4096 conv=fdatasync,notrunc 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 30.2524 s, 142 MB/s
So I can see write speed is about 140 MBps.
Random writes ~ 500 KBps ~100-150 iops. Which is normal.
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=1 --size=4G --readwrite=randwrite test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 fio-2.1.11 Starting 1 process bs: 1 (f=1): [w(1)] [0.6% done] [0KB/548KB/0KB /s] [0/137/0 iops] [eta 02h:02m:57s]
Then on the same drive I created ZFS:
zpool create -f -m /mnt/data bigdata scsi-35000cca01b370274
I set record size 4K because I will have 4K random writes. Record size 4K worked better than 128k when I was testing.
zfs set recordsize=4k bigdata
Tested random writes to 4G files.
fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=./test --filename=test --bs=4k --iodepth=1 --size=4G --readwrite=randwrite ./test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 fio-2.1.11 Starting 1 process ./test: Laying out IO file(s) (1 file(s) / 4096MB) Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/115.9MB/0KB /s] [0/29.7K/0 iops] [ [eta 00m:00s]
Looks like COW did well here 115.9MB per sec.
Tested random writes to 16G files.
fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test --filename=./test16G --bs=4k --iodepth=1 --size=16G --readwrite=randwrite test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 fio-2.1.11 Starting 1 process bs: 1 (f=1): [w(1)] [0.1% done] [0KB/404KB/0KB /s] [0/101/0 iops] [eta 02h:08m:55s]
Very poor results 400 kilobytes per second.
Tried the same with 8G files:
fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test --filename=./test8G --bs=4k --iodepth=1 --size=8G --readwrite=randwrite test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 fio-2.1.11 Starting 1 process test: Laying out IO file(s) (1 file(s) / 8192MB) bs: 1 (f=1): [w(1)] [14.5% done] [0KB/158.3MB/0KB /s] [0/40.6K/0 iops] [eta 00m:53s]
At the beginning COW was fine 136 megabytes per second.
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdg 0.00 0.00 0.00 1120.00 0.00 136.65 249.88 9.53 8.51 0.00 8.51 0.89 99.24
But at the end when test reached 90% write speed went down to 5 megabyte per second.
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdg 0.00 0.00 0.00 805.90 0.00 5.33 13.54 9.95 12.34 0.00 12.34 1.24 100.00
So 4G files are fine, 8G almost fine but 16G files are not getting any COW.
Don't understand what is happening here. Maybe memory caching plays role here.
OS: Debian 8
ZFS ver 500.
No compression or deduplication.
zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT bigdata 1.81T 64.4G 1.75T - 2% 3% 1.00x ONLINE - root@zfs:/mnt/data/test# zdb bigdata: version: 5000 name: 'bigdata' state: 0 txg: 4 pool_guid: 16909063556944448541 errata: 0 hostid: 8323329 hostname: 'zfs' vdev_children: 1 vdev_tree: type: 'root' id: 0 guid: 16909063556944448541 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 8547466691663463210 path: '/dev/disk/by-id/scsi-35000cca01b370274-part1' whole_disk: 1 metaslab_array: 34 metaslab_shift: 34 ashift: 9 asize: 2000384688128 is_log: 0 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data zpool status bigdata pool: bigdata state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM bigdata ONLINE 0 0 0 scsi-35000cca01b370274 ONLINE 0 0 0 errors: No known data errors
fio doesn't work with O_DIRECT on ZFS I had to run without it. As I understand it should produce even better results. But it is not happening.
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=./test --filename=test16G --bs=4k --iodepth=1 --size=16G --readwrite=randwrite ./test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 fio-2.1.11 Starting 1 process fio: looks like your file system does not support direct=1/buffered=0 fio: destination does not support O_DIRECT
Best Answer
This isn't a fair fight:
dd
whereasfio
defaults to generating a single random block and reusing it when possible. Since all zeroes is (slightly) more compressible this can skew numbers.dd
is 1MByte whereas the size being used by yourfio
line is 4KBytes.iodepth
of yourfio
run to only 1 (this choice compounds the smaller than dd block size choice above).dd
is always allowed to use the writeback cache but one of yourfio
runs was not (because you were settingdirect=1
)!O_DIRECT
even if you were to try and use a higher depth withlibaio
the submission wouldn't be asynchronous (see SSD IOPS on linux, DIRECT much faster than buffered, fio ) but as the maximum outstanding I/O is limited to one (see above) this is a moot issue.YourAs pointed out in the commentsdd
is writing so little data (4GBytes) in comparison to the size of your memory (24GBytes) that Linux could have significant amounts of write data still in its page cache when it finishes. You would have to do at least some sort of file sync to ensure it had really reached non-volatile storage...conv=fdatasync
has been set ondd
so there will be a final fdatasync before it exits which ensures the data isn't only in non-volatile caches.At the bear minimum I'd suggest starting over, do all your tests on ZFS, use
fio
for both the sequential and random tests and leaving the rest of the line the same. I'd also consider using something likeend_fsync
to ensure data had actually hit the disk and wasn't just in non-volatile caches (but I can see an argument for skipping this bit).TLDR; I'm afraid your comparison methodology is flawed - perhaps it's better to change less between comparisons?