Why does doing large deletes, copies, moves on the ZFS NAS block all other IOs

performancesolariszfs

I've got a Solaris 11 ZFS-based NAS device with 12x1TB 7.2k rpm SATA drives in a mirrored configuration.

It provides 2 services from the same pool – an NFS server for a small VM farm and a CIFS server for a small team to host shared files. The cifs ZFS has dedup on, while the NFS ZFS filesystem has dedup off. Compression is off everywhere. I'm snapshotting each filesystem every day and keeping the last 14 snapshots.

I've run into a performance issue in cases where I'm either moving, copying or deleting a large amount of data while directly SSH'd into the NAS. Basically, the process seems to block all other IO operations, even to the point of VMs stalling because they receive disk timeouts.

I've a couple of theories as to why this should be the case, but would appreciate some insight into what I might do next.

Either:

1) the hardware isn't good enough. I'm not so convinced of this – the system is an HP X1600 (single Xeon CPU) with 30GB RAM. Although the drives are only 7.2k SATA, they should push a max of 80 IOPS each, which should give me more than enough. Happy to be proven wrong though.

2) I've configured it wrong – more than likely. Is it worth turning dedup off everywhere? I'm working under the assumption that RAM = good for dedup, hence giving it a reasonable splodge of RAM.

3) Solaris being stupid about scheduling IO. Is it possible that a local rm command completely blocks IO to the nfsd? If so, how do I change this?

Best Answer

Option #2 is most likely the reason. Dedup performs best when the dedup table (DDT) fits entirely in memory. If it doesn't, then it spills over onto disk, and DDT lookups that have to go to disk are very slow and that produces the blocking behavior.

I would think that 30G of RAM is plenty, but the size of the DDT is directly dependent on the amount of data being deduped and how well dedup works on your data. The dedup property is set at the dataset level, but lookups are done across the entire pool, so there is just one pool-wide DDT.

See this zfs-discuss thread on calculating the DDT size. Essentially it's one DDT entry per unique block on the pool, so if you have a large amount of data but a low dedup ratio, that means more unique blocks and thus a larger DDT. The system tries to keep the DDT in RAM, but some of it may be evicted if the memory is needed for applications. Having L2ARC cache can help prevent the DDT from going to the main pool disks, as it will be evicted from main memory (ARC) into L2ARC.