With 20 disks you have a lot of options. I'm assuming you already have drives for the OS, so the 20 disks would be dedicated data drives. In my Sun Fire x4540 (48 drives), I've allocated 20 drives in a mirrored setup and 24 in a striped raidz1 config (6 disks per raidz and 4 striped vdevs). Two disks are for the OS and the remainder are spares.
Which controller are you using? You may want to refer to: ZFS SAS/SATA controller recommendations
Don't use the hardware raid if you can. ZFS thrives when drives are presented as raw disks to the OS.
Your raidz1 performance increases with the number of stripes across raidz1 groups. With 20 disks, you could use 4 raidz1 groups consisting of 5 disks each, or 5 groups of 4 disks. Performance on the latter will be better. Your fault tolerance in that setup would be sustaining the failure of 1 disk per group (e.g., potentially 4 or 5 disks could fail under the right conditions).
The read speed from a raidz1 or raidz2 group is equivalent to the read speed of one disk. With the above setup, your theoretical max read speeds would be equivalent to that of 4 or 5 disks (for each vdev/group of raidz1 disks).
Going with the mirrored setup would maximize speed, but you will run into the bandwidth limitations of your controller at that point. You may not need that type of speed, so I'd suggest a combination of raidz1 and stripes. In that case, you could sustain one failed disk per mirrored pair (e.g. 10 disks could possibly fail if they're the right ones).
Either way, you should consider a hot-spare arrangement no matter which solution you go with. Perhaps 18 disks in a mirrored arrangement with 2 hot-spares or a 3-stripe 6-disk raidz1 with 2 hot-spares...
When I built my first ZFS setup, I used this note from Sun to help understand RAID level performance...
http://blogs.oracle.com/relling/entry/zfs_raid_recommendations_space_performance
Examples with 20 disks:
20-disk mirrored pairs.
pool: vol1
state: ONLINE
scrub: scrub completed after 3h16m with 0 errors on Fri Nov 26 09:45:54 2010
config:
NAME STATE READ WRITE CKSUM
vol1 ONLINE 0 0 0
mirror ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c8t1d0 ONLINE 0 0 0
c9t1d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
c5t2d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c6t2d0 ONLINE 0 0 0
c7t2d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c8t2d0 ONLINE 0 0 0
c9t2d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
c5t3d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
c7t3d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c8t3d0 ONLINE 0 0 0
c9t3d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c5t4d0 ONLINE 0 0 0
20-disk striped raidz1 consisting of 4 stripes of 5-disk raidz1 vdevs.
pool: vol1
state: ONLINE
scrub: scrub completed after 14h38m with 0 errors on Fri Nov 26 21:07:53 2010
config:
NAME STATE READ WRITE CKSUM
vol1 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c6t4d0 ONLINE 0 0 0
c7t4d0 ONLINE 0 0 0
c8t4d0 ONLINE 0 0 0
c9t4d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c6t5d0 ONLINE 0 0 0
c7t5d0 ONLINE 0 0 0
c8t5d0 ONLINE 0 0 0
c9t5d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c6t6d0 ONLINE 0 0 0
c7t6d0 ONLINE 0 0 0
c8t6d0 ONLINE 0 0 0
c9t6d0 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c6t7d0 ONLINE 0 0 0
c7t7d0 ONLINE 0 0 0
c8t7d0 ONLINE 0 0 0
c9t7d0 ONLINE 0 0 0
c6t0d0 ONLINE 0 0 0
Edit:
Or if you want two pools of storage, you could break your 20 disks into two groups:
10 disks in mirrored pairs (5 per controller).
AND
3 stripes of 3-disk raidz1 groups
AND
1 global spare...
That gives you both types of storage, good redundancy, a spare drive, and you can test the performance of each pool back-to-back.
One of the servers that I administrate runs the type of configuration that you describe. It has six 1TB hard drives with a LUKS-encrypted RAIDZ pool on it. I also have two 3TB hard drives in a LUKS-encrypted ZFS mirror that are swapped out every week to be taken off-site. The server has been using this configuration for about three years, and I've never had a problem with it.
If you have a need for ZFS with encryption on Linux then I recommend this setup. I'm using ZFS-Fuse, not ZFS on Linux. However, I believe that would have no bearing on the result other than ZFS on Linux will probably have better performance than the setup that I am using.
In this setup redundant data is encrypted several times because LUKS is not "aware" of Z-RAID. In LUKS-on-mdadm solution data is encrypted once and merely written to disks multiple times.
Keep in mind that LUKS isn't aware of RAID. It only knows that it's sitting on top of a block device. If you use mdadm to create a RAID device and then luksformat
it, it is mdadm that is replicating the encrypted data to the underlying storage devices, not LUKS.
Question 2.8 of the LUKS FAQ addresses whether encryption should be on top of RAID or the other way around. It provides the following diagram.
Filesystem <- top
|
Encryption
|
RAID
|
Raw partitions
|
Raw disks <- bottom
Because ZFS combines the RAID and filesystem functionality, your solution will need to look like the following.
RAID-Z and ZFS Filesystem <-top
|
Encryption
|
Raw partitions (optional)
|
Raw disks <- bottom
I've listed the raw partitions as optional as ZFS expects that it will use raw block storage rather than a partition. While you could create your zpool using partitions, it's not recommended because it'll add a useless level of management, and it will need to be taken into account when calculating what your offset will be for partition block alignment.
Wouldn't it significantly impede write performance? [...] My CPU supports Intel AES-NI.
There shouldn't be a performance problem as long as you choose an encryption method that's supported by your AES-NI driver. If you have cryptsetup 1.6.0 or newer you can run cryptsetup benchmark
and see which algorithm will provide the best performance.
This question on recommended options for LUKS may also be of value.
Given that you have hardware encryption support, you are more likely to face performance issues due to partition misalignment.
ZFS on Linux has added the ashift
property to the zfs
command to allow you to specify the sector size for your hard drives. According to the linked FAQ, ashift=12
would tell it that you are using drives with a 4K block size.
The LUKS FAQ states that a LUKS partition has an alignment of 1 MB. Questions 6.12 and 6.13 discuss this in detail and also provide advice on how to make the LUKS partition header larger. However, I'm not sure it's possible to make it large enough to ensure that your ZFS filesystem will be created on a 4K boundary. I'd be interested in hearing how this works out for you if this is a problem you need to solve. Since you are using 2TB drives, you might not face this problem.
Will ZFS be aware of disk failures when operating on device-mapper LUKS containers as opposed to physical devices?
ZFS will be aware of disk failures insofar as it can read and write to them without problems. ZFS requires block storage and doesn't care or know about the specifics of that storage and where it comes from. It only keeps track of any read, write or checksum errors that it encounters. It's up to you to monitor the health of the underlying storage devices.
The ZFS documentation has a section on troubleshooting which is worth reading. The section on replacing or repairing a damaged device describes what you might encounter during a failure scenario and how you might resolve it. You'd do the same thing here that you would for devices that don't have ZFS. Check the syslog for messages from your SCSI driver, HBA or HD controller, and/or SMART monitoring software and then act accordingly.
How about deduplication and other ZFS features?
All of the ZFS features will work the same regardless of whether the underlying block storage is encrypted or not.
Summary
- ZFS on LUKS-encrypted devices works well.
- If you have hardware encryption, you won't see a performance hit as long as you use an encryption method that's supported by your hardware. Use
cryptsetup benchmark
to see what will work best on your hardware.
- Think of ZFS as RAID and filesystem combined into a single entity. See the ASCII diagram above for where it fits into the storage stack.
- You'll need to unlock each LUKS-encrypted block device that the ZFS filesystem uses.
- Monitor the health of the storage hardware the same way you do now.
- Be mindful of the filesystem's block alignment if you are using drives with 4K blocks. You may need to experiment with luksformat options or other settings to get the alignment you need for acceptable speed.
February 2020 Update
It's been six years since I wrote this answer. ZFS on Linux v0.8.0 supports native encryption, which you should consider if you don't have a specific need for LUKS.
Best Answer
Synchronous writing mode ensures that the writes end up in a persistent location immediately. With asynchronous writes, data is cached in RAM and the write call finishes right away. The filesystem will schedule the actual writes to final location (hard disk).
In ZFS case, the point of ZIL / SLOG is to act as a fast interim persistent storage, that allows synchronous mode, that is, ensuring writing client that the writes are final. Otherwise the filesystem would need to write the blocks to the hard disk directly, which makes synchronous mode slow.
In your case, if you want to ensure full speed writing of 40 GB of data, then you should increase your RAM size to cover the size of the file.
However, since the FS starts writing to hard disks immediately, you don't need 40GB memory to get full speed for your writes. For example, when the client has written 20GB of data, 10GB could be in RAM cache and the rest 10GB already in hard drive.
So, you need to do some benchmarking to see how much RAM you need in order to get the full speed writes.