Before we go into specifics, consider your use case. Are you storing photos, MP3's and DVD rips? If so, you might not care whether you permanently lose a single block from the array. On the other hand, if it's important data, this might be a disaster.
The statement that RAIDZ-1 is "not good enough for real world failures" is because you are likely to have a latent media error on one of your surviving disks when reconstruction time comes. The same logic applies to RAID5.
ZFS mitigates this failure to some extent. If a RAID5 device can't be reconstructed, you are pretty much out of luck; copy your (remaining) data off and rebuild from scratch. With ZFS, on the other hand, it will reconstruct all but the bad chunk, and let the administrator "clear" the errors. You'll lose a file/portion of a file, but you won't lose the entire array. And, of course, ZFS's parity checking means that you will be reliably informed that there's an error. Otherwise, I believe it's possible (although unlikely) that multiple errors will result in a rebuild apparently succeeding, but giving you back bad data.
Since ZFS is a "Rampant Layering Violation," it also knows which areas don't have data on them, and can skip them in the rebuild. So if your array is half empty you're half as likely to have a rebuild error.
You can reduce the likelihood of these kinds of rebuild errors on any RAID level by doing regular "zpool scrubs" or "mdadm checks"of your array. There are similar commands/processes for other RAID's; e.g., LSI/dell PERC raid cards call this "patrol read." These go read everything, which may help the disk drives find failing sectors, and reassign them, before they become permanent. If they are permanent, the RAID system (ZFS/md/raid card/whatever) can rebuild the data from parity.
Even if you use RAIDZ2 or RAID6, regular scrubs are important.
One final note - RAID of any sort is not a substitute for backups - it won't protect you against accidental deletion, ransomware, etc. Although regular ZFS snapshots can be part of a backup strategy.
How do companies manage large file servers (e.g. 17 TB) and their related backups on a very tight-budget?
It depends on the performance and budget constraints. For example, Backblaze (a cloud backup company) has a very tight budget for each TB, but they don't need top performance or response time for the data. Some other company might need cheap performance, but finds a way to reduce the data needed (with deduplication, pruning of old backup data or by simply reducing the actual data that may not be needed for the business).
Is it ok to use ZFS (or BTRFS) on a single virtual disk for its copy-on-write nature to eliminate the need for fsck? (i.e. not for its RAID, snapshot, etc. features)
I would not use BTRFS on anything non-development where you need to trust your data safety as first principle. I would use ZFS because I have not found a cheaper and safer alternative to it (other FS with similar features from IBM, NetApp etc. are more expensive, other free filesystems are either not mature enought (HAMMER2, BTRFS) or lack essential features (ext2/3/4, reiserfs, etc.).
Your specific answers: I would prefer plan 3, but plan 2 would also work.
Contrary to much FUD floating around the web, ZFS does not care about the underlying storage, whether it be physical, virtual or mixed. Of course, it can only be as good/fast/safe as the underlying storage and can only reason about what is given. This means, your performance will not be as good as native, your troubleshooting involves both layers and both layers can have problems. If you know this and provision for these downsides, I see it as a viable alternative.
You still have comfort features like send/recv, snapshots, CoW, checksums and block-level deduplication. What you sacrifice is mostly performance and possibly safety (if your SAN is just one single disk, for example). You should also tune your ZFS sector sizes (ashift
) to your underlying storage when creating the pool. You can have different sizes for individual file systems afterwards, but the pool setting cannot be reversed without destroying it.
But I would first thoroughly evaluate whether the rewrite of those scripts and integrations is as time-consuming as you imagine. Also, even such appliances like FreeNAS (or napp-it, to name an alternative) have usually user-scripts or user-definable plugins or modules, which survive updates and work well with the appliance (the new FreeNAS 10 aka Corral replaced its plugin architecture with Docker, if that is something you are familiar with it might be an alternative).
Best Answer
I use both, but all my important data resides on NexentaStor. FreeNAS is not nearly as robust as NexentaStor. FreeNAS is pretty reliable but it's just not designed for production environments the way NexentaStor is.