Does ZFS really stripe across every vdev, even in very large zpools

iopszfs

I have read that ZFS stripes the data in a zpool across all top-level vdevs, assuming all vdevs are added at the beginning of the pool's life. Everything I have read seems to regard this as a good thing. But, it seems to me like for deployments with many disks, this is not going to lead to good overall performance from all those disks in a multi-user (or even just multi-process) environment.

Suppose for example I have 96 disks, which I use to create 12 vdevs of 8 disks each, all of which I add to my zpool. Then I set it loose on users and they fill it up with all manner of craziness. Some files are tens of gigabytes, others are small user application configuration files etc.

Later, user A wants to copy some multi-gigabyte files. She starts an rsync or somesuch, and experiences blazing performance from the underlying sequential reads off the 12 striped vdevs. But then user B fires up another application that also requests fairly large chunks of data at a time. Now drive heads are constantly getting pulled off user A's rsync to deal with user B, and although each application is individually relatively sequential, the 96 disks are all involved in both users' requests, and see seek patterns and performance more consistent with random I/O.

In this 12 vdevs of 8 disks configuration, each vdev still has 8 disks' worth of performance, so I'd expect sequential I/O to be very good even without additional striping across other vdevs. Wouldn't it be better for ZFS to put many gigabytes on one vdev before moving on to another one? (In my experiments I'm getting stripes around 500k.) In that way, user A's reads would only have a 1/12 chance of using the same disks as user B's reads, and they'd both get performance consistent with sequential I/O most of the time.

Is there a way to get good performance from ZFS in this configuration/workload?

Best Answer

ZFS always stripes over all vdevs, although it depends on how many blocks are needed by the file - small files will often fit into single block, and thus land on single vdev, unless they belong to dataset configured with copies=2 or copies=3.

No, you can't change that or split without making separate pools.

In order to improve performance over such striped setup, ZFS includes its own IO scheduler in the ZIO component (which is why on linux deadline or noop schedulers are recommended).

Another layer that improves such workloads is ARC which among other things includes prefetch cache. You can speed ARC up with L2ARC on separate fast devices, with equivalent for synchronous writes being SLOG (dedicated ZIL devices).