MongoDB and ZFS bad performance: disk always busy with reads while doing only writes

mongodbzfszfsonlinux

I have huge performance issues using MongoDB (i believe it is mmapped DB) with ZFSonlinux.

Our Mongodb is almost only writes. On replicas without ZFS, disk is completely busy for ~5s spikes, when app writes into DB every 30s, and no disk activity in between, so i take that as the baseline behaviour to compare.
On replicas with ZFS, disk is completely busy all the time, with the replicas stuggling to keep up to date with the MongoDB primary. I have lz4 compression enabled on all replicas, and the space savings are great, so there should be much less data to hit the disk

So on these ZFS servers, i first had the default recordsize=128k. Then i wiped the data and set recordsize=8k before resyncing Mongo data. Then i wiped again and tried recordsize=1k. I also tried recordsize=8k without checksums

Nevertheless, it did not solved anything, disk was always kept a 100% busy.
Only once on one server with recordsize=8k, the disk was much less busy than any non-ZFS replicas, but after trying different setting and trying again with recordsize=8k, disk was 100%, i could not see the previous good behaviour, and could not see it on any other replica either.

Moreover, there should be almost only writes, but see that on all replicas under different settings, disk is completely busy with 75% reads and only 25% writes

(Note, i believe MongoDB is mmapped DB. I was told to try MongoDB in AIO mode, but i did not find how to set it, and with another server running MySQL InnoDB i realised that ZFSonLinux did not support AIO anyway.)

My servers are CentOS 6.5 kernel 2.6.32-431.5.1.el6.x86_64.
spl-0.6.2-1.el6.x86_64
zfs-0.6.2-1.el6.x86_64

#PROD 13:44:55 root@rum-mongo-backup-1:~: zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
zfs                      216G  1.56T    32K  /zfs
zfs/mongo_data-rum_a    49.5G  1.56T  49.5G  /zfs/mongo_data-rum_a
zfs/mongo_data-rum_old   166G  1.56T   166G  /zfs/mongo_data-rum_old

#PROD 13:45:20 root@rum-mongo-backup-1:~: zfs list -t snapshot
no datasets available

#PROD 13:45:29 root@rum-mongo-backup-1:~: zfs list -o atime,devices,compression,copies,dedup,mountpoint,recordsize,casesensitivity,xattr,checksum
ATIME  DEVICES  COMPRESS  COPIES          DEDUP  MOUNTPOINT               RECSIZE         CASE  XATTR   CHECKSUM
  off       on       lz4       1            off  /zfs                        128K    sensitive     sa        off
  off       on       lz4       1            off  /zfs/mongo_data-rum_a         8K    sensitive     sa        off
  off       on       lz4       1            off  /zfs/mongo_data-rum_old       8K    sensitive     sa        off

What could be going on there ? What should i look to figure out what ZFS is doing or which setting is badly set ?

EDIT1:
hardware: These are rented servers, 8 vcores on Xeon 1230 or 1240, 16 or 32GB RAM, with zfs_arc_max=2147483648, using HP hardware RAID1. So ZFS zpool is on /dev/sda2 and does not know that there is an underlying RAID1. Even being a suboptimal setup for ZFS, i still do not understand why disk is choking on reads while DB does only writes.
I understand the many reasons, which we do not need to expose here again, that this is bad and bad, … for ZFS, and i will soon have a JBOD/NORAID server which i can do the same tests with ZFS's own RAID1 implementation on sda2 partition, with /, /boot and swap partitions doing software RAID1 with mdadm.

Best Answer

This may sound a bit crazy, but I support another application that benefits from ZFS volume management attributes, but does not perform well on the native ZFS filesystem.

My solution?!?

XFS on top of ZFS zvols.

Why?!?

Because XFS performs well and eliminates the application-specific issues I was facing with native ZFS. ZFS zvols allow me to thin-provision volumes, add compression, enable snapshots and make efficient use of the storage pool. More important for my app, the ARC caching of the zvol reduced the I/O load on the disks.

See if you can follow this output:

# zpool status
  pool: vol0
 state: ONLINE
  scan: scrub repaired 0 in 0h3m with 0 errors on Sun Mar  2 12:09:15 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol0                                            ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            scsi-SATA_OWC_Mercury_AccOW140128AS1243223  ONLINE       0     0     0
            scsi-SATA_OWC_Mercury_AccOW140128AS1243264  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            scsi-SATA_OWC_Mercury_AccOW140128AS1243226  ONLINE       0     0     0
            scsi-SATA_OWC_Mercury_AccOW140128AS1243185  ONLINE       0     0     0

ZFS zvol, created with: zfs create -o volblocksize=128K -s -V 800G vol0/pprovol (note that auto-snapshots are enabled)

# zfs get all vol0/pprovol
NAME          PROPERTY               VALUE                  SOURCE
vol0/pprovol  type                   volume                 -
vol0/pprovol  creation               Wed Feb 12 14:40 2014  -
vol0/pprovol  used                   273G                   -
vol0/pprovol  available              155G                   -
vol0/pprovol  referenced             146G                   -
vol0/pprovol  compressratio          3.68x                  -
vol0/pprovol  reservation            none                   default
vol0/pprovol  volsize                900G                   local
vol0/pprovol  volblocksize           128K                   -
vol0/pprovol  checksum               on                     default
vol0/pprovol  compression            lz4                    inherited from vol0
vol0/pprovol  readonly               off                    default
vol0/pprovol  copies                 1                      default
vol0/pprovol  refreservation         none                   default
vol0/pprovol  primarycache           all                    default
vol0/pprovol  secondarycache         all                    default
vol0/pprovol  usedbysnapshots        127G                   -
vol0/pprovol  usedbydataset          146G                   -
vol0/pprovol  usedbychildren         0                      -
vol0/pprovol  usedbyrefreservation   0                      -
vol0/pprovol  logbias                latency                default
vol0/pprovol  dedup                  off                    default
vol0/pprovol  mlslabel               none                   default
vol0/pprovol  sync                   standard               default
vol0/pprovol  refcompressratio       4.20x                  -
vol0/pprovol  written                219M                   -
vol0/pprovol  snapdev                hidden                 default
vol0/pprovol  com.sun:auto-snapshot  true                   local

Properties of ZFS zvol block device. 900GB volume (143GB actual size on disk):

# fdisk -l /dev/zd0

Disk /dev/zd0: 966.4 GB, 966367641600 bytes
3 heads, 18 sectors/track, 34952533 cylinders
Units = cylinders of 54 * 512 = 27648 bytes
Sector size (logical/physical): 512 bytes / 131072 bytes
I/O size (minimum/optimal): 131072 bytes / 131072 bytes
Disk identifier: 0x48811e83

    Device Boot      Start         End      Blocks   Id  System
/dev/zd0p1              38    34952534   943717376   83  Linux

XFS information on ZFS block device:

# xfs_info /dev/zd0p1
meta-data=/dev/zd0p1             isize=256    agcount=32, agsize=7372768 blks
         =                       sectsz=4096  attr=2, projid32bit=0
data     =                       bsize=4096   blocks=235928576, imaxpct=25
         =                       sunit=32     swidth=32 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=65536, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

XFS mount options:

# mount
/dev/zd0p1 on /ppro type xfs (rw,noatime,logbufs=8,logbsize=256k,nobarrier)

Note: I also do this on top of HP Smart Array hardware RAID in some cases.

The pool creation looks like:

zpool create -o ashift=12 -f vol1 wwn-0x600508b1001ce908732af63b45a75a6b

With the result looking like:

# zpool status  -v
  pool: vol1
 state: ONLINE
  scan: scrub repaired 0 in 0h14m with 0 errors on Wed Feb 26 05:53:51 2014
config:

        NAME                                      STATE     READ WRITE CKSUM
        vol1                                      ONLINE       0     0     0
          wwn-0x600508b1001ce908732af63b45a75a6b  ONLINE       0     0     0

Related Solutions

Freebsd – Why does zfs fail to cache this workload when “normal” filesystems cache it completely

Start with disabling atime if not already done.

You might also investigate setting primarycache=metadata impact.

Linux – ZFS pool slow sequential read

I managed to get speeds very close to the numbers I was expecting.

I was looking for 400MB/sec and managed 392MB/sec. So I say that is problem solved. With the later addition of a cache device, I managed 458MB/sec read (cached I believe).

1. This at first was achieved simply by increasing the ZFS dataset recordsize value to 1M

zfs set recordsize=1M pool2/test

I believe this change just results in less disk activity, thus more efficient large synchronous reads and writes. Exactly what I was asking for.

Results after the change

bonnie++ = 226MB write, 392MB read
dd = 260MB write, 392MB read
2 processes in parallel = 227MB write, 396MB read

2. I managed even better when I added a cache device (120GB SSD). The write is a tad slower, I'm not sure why.

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
igor            63G           208325  48 129343  28           458513  35 326.8  16

The trick with the cache device was to set l2arc_noprefetch=0 in /etc/modprobe.d/zfs.conf. It allows ZFS to cache streaming/sequential data. Only do this if your cache device is faster than your array, like mine.

After benefiting from the recordsize change on my dataset, I thought it might be a similar way to deal with poor zvol performance.

I came across severel people mentioning that they obtained good performance using a volblocksize=64k, so I tried it. No luck.

zfs create -b 64k -V 120G pool/volume

But then I read that ext4 (the filesystem I was testing with) supports options for RAID like stride and stripe-width, which I've never used before. So I used this site to calculate the settings needed: https://busybox.net/~aldot/mkfs_stride.html and formatted the zvol again.

mkfs.ext3 -b 4096 -E stride=16,stripe-width=32 /dev/zvol/pool/volume

I ran bonnie++ to do a simple benchmark and the results were excellent. I don't have the results with me unfortunately, but they were atleast 5-6x faster for writes as I recall. I'll update this answer again if I benchmark again.

Best Answer

Related Solutions

Freebsd – Why does zfs fail to cache this workload when “normal” filesystems cache it completely

Linux – ZFS pool slow sequential read

Related Topic