High speed network writes with large capacity storage

network-attached-storageperformancestoragezfs

I have a NAS running Samba with a 20T ZFS pool with one raid1 vdev with two spinning rust drives. I have 16G RAM in the machine right now. The storage is used for continuously growing, permanent backup archive of video footage. It's write once, read once for processing and then possibly backup restore.

I regularly fling 40GiB files to this NAS. I'm going to upgrade my gigabit network to 10GbE in order to make this process less painful. However I'm suspecting I'm going to become limited by the write speed of the underlying drives.

My understanding is that a ZIL and SLOG only accelerate synchronous writes so adding an nvme SSD as SLOG wouldn't affect my use case as I believe Samba is using asynchronous writes by default.

I'm not sure if configuring samba for synchronous writes and adding a SLOG on a nvme SSD would do what I need. I understand this comes with the risk of data loss if the drive fails or power cuts out. This is acceptable as I retain the files on the source machine long enough to retransfer in case of near term data loss. Wear and tear on the SSD is a concern but typical drives have 300 TBW or there about which is enough to fill my never-delete NAS 15 times over, or in 75 years at current data generation rate, I'm ok with that and to buy a new SSD if/when the SSD breaks. These are acceptable caveats. Normally I would just try and benchmark but in the current everything-shortage I'd like to know ahead of time what I need to purchase for this.

I know I can add more raid 1 vdevs to the pool to get a raid 10 pool but this is too expensive, the midtower chassis cannot support that many drives, it grossly over provisions the pool together with the existing drives and would use more energy over time to keep all that rust spinning.

What are my options for achieving write speeds in excess of 10Gbps to this zfs pool for at least 40GiB worth of data, aside from adding more spinning rust to the pool in a raid 10 fashion?

Best Answer

Synchronous writing mode ensures that the writes end up in a persistent location immediately. With asynchronous writes, data is cached in RAM and the write call finishes right away. The filesystem will schedule the actual writes to final location (hard disk).

In ZFS case, the point of ZIL / SLOG is to act as a fast interim persistent storage, that allows synchronous mode, that is, ensuring writing client that the writes are final. Otherwise the filesystem would need to write the blocks to the hard disk directly, which makes synchronous mode slow.

In your case, if you want to ensure full speed writing of 40 GB of data, then you should increase your RAM size to cover the size of the file.

However, since the FS starts writing to hard disks immediately, you don't need 40GB memory to get full speed for your writes. For example, when the client has written 20GB of data, 10GB could be in RAM cache and the rest 10GB already in hard drive.

So, you need to do some benchmarking to see how much RAM you need in order to get the full speed writes.

Related Solutions

ZFS pool config – advice required

With 20 disks you have a lot of options. I'm assuming you already have drives for the OS, so the 20 disks would be dedicated data drives. In my Sun Fire x4540 (48 drives), I've allocated 20 drives in a mirrored setup and 24 in a striped raidz1 config (6 disks per raidz and 4 striped vdevs). Two disks are for the OS and the remainder are spares.

Which controller are you using? You may want to refer to: ZFS SAS/SATA controller recommendations

Don't use the hardware raid if you can. ZFS thrives when drives are presented as raw disks to the OS.

Your raidz1 performance increases with the number of stripes across raidz1 groups. With 20 disks, you could use 4 raidz1 groups consisting of 5 disks each, or 5 groups of 4 disks. Performance on the latter will be better. Your fault tolerance in that setup would be sustaining the failure of 1 disk per group (e.g., potentially 4 or 5 disks could fail under the right conditions).

The read speed from a raidz1 or raidz2 group is equivalent to the read speed of one disk. With the above setup, your theoretical max read speeds would be equivalent to that of 4 or 5 disks (for each vdev/group of raidz1 disks).

Going with the mirrored setup would maximize speed, but you will run into the bandwidth limitations of your controller at that point. You may not need that type of speed, so I'd suggest a combination of raidz1 and stripes. In that case, you could sustain one failed disk per mirrored pair (e.g. 10 disks could possibly fail if they're the right ones).

Either way, you should consider a hot-spare arrangement no matter which solution you go with. Perhaps 18 disks in a mirrored arrangement with 2 hot-spares or a 3-stripe 6-disk raidz1 with 2 hot-spares...

When I built my first ZFS setup, I used this note from Sun to help understand RAID level performance...

http://blogs.oracle.com/relling/entry/zfs_raid_recommendations_space_performance

Examples with 20 disks:

20-disk mirrored pairs.

  pool: vol1
 state: ONLINE
 scrub: scrub completed after 3h16m with 0 errors on Fri Nov 26 09:45:54 2010
config:

        NAME        STATE     READ WRITE CKSUM
        vol1        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0
            c5t1d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c6t1d0  ONLINE       0     0     0
            c7t1d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c8t1d0  ONLINE       0     0     0
            c9t1d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c4t2d0  ONLINE       0     0     0
            c5t2d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c6t2d0  ONLINE       0     0     0
            c7t2d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c8t2d0  ONLINE       0     0     0
            c9t2d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c4t3d0  ONLINE       0     0     0
            c5t3d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c6t3d0  ONLINE       0     0     0
            c7t3d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c8t3d0  ONLINE       0     0     0
            c9t3d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c4t4d0  ONLINE       0     0     0
            c5t4d0  ONLINE       0     0     0

20-disk striped raidz1 consisting of 4 stripes of 5-disk raidz1 vdevs.

  pool: vol1
 state: ONLINE
 scrub: scrub completed after 14h38m with 0 errors on Fri Nov 26 21:07:53 2010
config:

        NAME        STATE     READ WRITE CKSUM
        vol1        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c6t4d0  ONLINE       0     0     0
            c7t4d0  ONLINE       0     0     0
            c8t4d0  ONLINE       0     0     0
            c9t4d0  ONLINE       0     0     0
            c4t5d0  ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c6t5d0  ONLINE       0     0     0
            c7t5d0  ONLINE       0     0     0
            c8t5d0  ONLINE       0     0     0
            c9t5d0  ONLINE       0     0     0
            c4t6d0  ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c6t6d0  ONLINE       0     0     0
            c7t6d0  ONLINE       0     0     0
            c8t6d0  ONLINE       0     0     0
            c9t6d0  ONLINE       0     0     0
            c4t7d0  ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c6t7d0  ONLINE       0     0     0
            c7t7d0  ONLINE       0     0     0
            c8t7d0  ONLINE       0     0     0
            c9t7d0  ONLINE       0     0     0
            c6t0d0  ONLINE       0     0     0

Edit: Or if you want two pools of storage, you could break your 20 disks into two groups:

10 disks in mirrored pairs (5 per controller).
AND
3 stripes of 3-disk raidz1 groups
AND
1 global spare...

That gives you both types of storage, good redundancy, a spare drive, and you can test the performance of each pool back-to-back.

Linux – ZFS RAID and LUKS encryption in Linux

One of the servers that I administrate runs the type of configuration that you describe. It has six 1TB hard drives with a LUKS-encrypted RAIDZ pool on it. I also have two 3TB hard drives in a LUKS-encrypted ZFS mirror that are swapped out every week to be taken off-site. The server has been using this configuration for about three years, and I've never had a problem with it.

If you have a need for ZFS with encryption on Linux then I recommend this setup. I'm using ZFS-Fuse, not ZFS on Linux. However, I believe that would have no bearing on the result other than ZFS on Linux will probably have better performance than the setup that I am using.

In this setup redundant data is encrypted several times because LUKS is not "aware" of Z-RAID. In LUKS-on-mdadm solution data is encrypted once and merely written to disks multiple times.

Keep in mind that LUKS isn't aware of RAID. It only knows that it's sitting on top of a block device. If you use mdadm to create a RAID device and then luksformat it, it is mdadm that is replicating the encrypted data to the underlying storage devices, not LUKS.

Question 2.8 of the LUKS FAQ addresses whether encryption should be on top of RAID or the other way around. It provides the following diagram.

Filesystem     <- top
|
Encryption
|
RAID
|
Raw partitions
|
Raw disks      <- bottom

Because ZFS combines the RAID and filesystem functionality, your solution will need to look like the following.

RAID-Z and ZFS Filesystem  <-top
|
Encryption
|
Raw partitions (optional)
|
Raw disks                  <- bottom

I've listed the raw partitions as optional as ZFS expects that it will use raw block storage rather than a partition. While you could create your zpool using partitions, it's not recommended because it'll add a useless level of management, and it will need to be taken into account when calculating what your offset will be for partition block alignment.

Wouldn't it significantly impede write performance? [...] My CPU supports Intel AES-NI.

There shouldn't be a performance problem as long as you choose an encryption method that's supported by your AES-NI driver. If you have cryptsetup 1.6.0 or newer you can run cryptsetup benchmark and see which algorithm will provide the best performance.

This question on recommended options for LUKS may also be of value.

Given that you have hardware encryption support, you are more likely to face performance issues due to partition misalignment.

ZFS on Linux has added the ashift property to the zfs command to allow you to specify the sector size for your hard drives. According to the linked FAQ, ashift=12 would tell it that you are using drives with a 4K block size.

The LUKS FAQ states that a LUKS partition has an alignment of 1 MB. Questions 6.12 and 6.13 discuss this in detail and also provide advice on how to make the LUKS partition header larger. However, I'm not sure it's possible to make it large enough to ensure that your ZFS filesystem will be created on a 4K boundary. I'd be interested in hearing how this works out for you if this is a problem you need to solve. Since you are using 2TB drives, you might not face this problem.

Will ZFS be aware of disk failures when operating on device-mapper LUKS containers as opposed to physical devices?

ZFS will be aware of disk failures insofar as it can read and write to them without problems. ZFS requires block storage and doesn't care or know about the specifics of that storage and where it comes from. It only keeps track of any read, write or checksum errors that it encounters. It's up to you to monitor the health of the underlying storage devices.

The ZFS documentation has a section on troubleshooting which is worth reading. The section on replacing or repairing a damaged device describes what you might encounter during a failure scenario and how you might resolve it. You'd do the same thing here that you would for devices that don't have ZFS. Check the syslog for messages from your SCSI driver, HBA or HD controller, and/or SMART monitoring software and then act accordingly.

How about deduplication and other ZFS features?

All of the ZFS features will work the same regardless of whether the underlying block storage is encrypted or not.

Summary

ZFS on LUKS-encrypted devices works well.
If you have hardware encryption, you won't see a performance hit as long as you use an encryption method that's supported by your hardware. Use cryptsetup benchmark to see what will work best on your hardware.
Think of ZFS as RAID and filesystem combined into a single entity. See the ASCII diagram above for where it fits into the storage stack.
You'll need to unlock each LUKS-encrypted block device that the ZFS filesystem uses.
Monitor the health of the storage hardware the same way you do now.
Be mindful of the filesystem's block alignment if you are using drives with 4K blocks. You may need to experiment with luksformat options or other settings to get the alignment you need for acceptable speed.

February 2020 Update

It's been six years since I wrote this answer. ZFS on Linux v0.8.0 supports native encryption, which you should consider if you don't have a specific need for LUKS.