Centos – Gluster + ZFS, deadlock during benchmarking: zfs_iput_taskq 100% cpu

centosglusterfsnfsstoragezfs

First some background:
I work at a company that runs a PHP-webapplication. We have a storage backend mounted over NFS on several webservers. Today we have the issue if one webserver writes a file over NFS, sometimes the file does not appear at other mounted clients until a few minutes later. It is also not redundant so we cannot perform any "invisible" maintenance.

I've been looking at migrating to a GlusterFS solution (two or three replicated bricks/machines for redundancy). Now, using XFS as the storage filesystem "behind" Gluster works very well, performance wise. Gluster also does not seem to have the sync problem mentioned above.

However, I would like to use ZFS as the backend filesystem, the reasons being;

  • Cheap compression (currently storing 1.5TB uncompressed)
  • Very easy to expand the storage-volume "live" (one command, compared
    the LVM mess)
  • Snapshotting, bit-rot protection and all the other ZFS glory.

In my demo-setup of the solution I have three servers with Replicated Gluster with a ZFS backend pool at a separate disk on each server. I'm using CentOS 6.5 with ZFS on Linux (0.6.2) + GlusterFS 3.4. I have also tried with Ubuntu 13.10. Everything is in VMware ESX.

To test this setup I have mounted the volume over Gluster, and then running BlogBench (http://www.pureftpd.org/project/blogbench) to simulate load. The issue I'm having is that at the end of the test, the ZFS storage seems to get stuck in a deadlock. All three machines have "zfs_iput_taskq" running at 90-100% CPU, and the test freezes. If I abort the test, the deadlock does not go away, only option seems to be hard reboot.

I have tried:

  • Disabled atime
  • Disabled scheduler (noop)
  • Different compression/no compression
  • Blogbench directly on ZFS works fine
  • Blogbench on Gluster + XFS as backend works fine

Ideas? Should I just drop ZFS and go with something else? alternatives?

Regards Oscar

Best Answer

ZFS on Linux needs a bit of basic tuning in order to operate well under load. There's a bit of a struggle between the ZFS ARC and the Linux virtual memory subsystem.

For your CentOS systems, try the following:

Create an /etc/modprobe.d/zfs.conf configuration file. This is read during the module load/boot.

Add something like:

options zfs zfs_arc_max=40000000000
options zfs zfs_vdev_max_pending=24

Where zfs_arc_max is roughly 40% of your RAM in bytes (Edit: try zfs_arc_max=1200000000). The compiled-in default for zfs_vdev_max_pending is 8 or 10, depending on version. The value should be high (48) for SSD or low-latency drives. Maybe 12-24 for SAS. Otherwise, leave at default.

You'll want to also have some floor values in /etc/sysctl.conf

vm.swappiness = 10
vm.min_free_kbytes = 512000

Finally, with CentOS, you may want to install tuned and tuned-utils and set your profile to virtual-guest with tuned-adm profile virtual-guest.

Try these and see if the problem persists.

Edit:

Run zfs set xattr=sa storage. Here's why. You may have to wipe the volumes and start again (I'd definitely recommend doing so).