Freebsd – ZFS over iSCSI high-availability solution

freebsdiscsistoragezfs

I am considering a ZFS/iSCSI based architecture for a HA/scale-out/shared-nothing database platform running on wimpy nodes of plain PC hardware and running FreeBSD 9.

Will it work? What are possible drawbacks?

Architecture

  1. Storage nodes have direct attached cheap SATA/SAS drives. Each disk is exported as a separate iSCSI LUN. Note that no RAID (neither HW nor SW), partitioning, volume management or anything like that is involved at this layer. Just 1 LUN per physical disk.

  2. Database nodes run ZFS. A ZFS mirrored vdev is created from iSCSI LUNs from 3 different storage nodes. A ZFS pool is created on top of the vdev, and within that a filesystem which in turn backs a database.

  3. When a disk or a storage node fails, the respective ZFS vdev will continue to operate in degraded mode (but still have 2 mirrored disks). A different (new) disk is assigned to the vdev to replace the failed disk or storage node. ZFS resilvering takes place. A failed storage node or disk is always completely recycled should it become available again.

  4. When a database node fails, the LUNs previsouly used by that node are free. A new database node is booted, which recreates the ZFS vdev/pool from the LUNs the failed database node left over. There is no need for database level replication for high-availability reasons.

Possible Issues

  • How to detect the degradion of the vdev? Check every 5s? Any notification mechnism available with ZFS?

  • Is it even possible to recreate a new pool from existing LUNs making up a vdev? Any traps?

Best Answer

It's not a direct answer to your question, but a more traditional architecture for this sort of thing would be to use HAST and CARP to take care of the storage redundancy.


A basic outline (see the linked documentation for better details):

Machine A ("Master")

  • Configure the HAST daemon & create an appropriate resource for each pool-member device.
  • Create your ZFS mirrored device as you would on any single system, using the HAST devices.

Machine B ("Slave")

  • Configure the HAST daemon similarly to what you did on Master, but bring it up as a secondary/slave node.
    (HAST will mirror all the data from the Master to the Slave for you)

Both Machines


The big caveat here is that HAST only works on a Master/Slave level, so you need pairs of machines for each LUN/set of LUNs you want to export.

Another thing to be aware of is that your storage architecture won't be as flexible as it would be with the design you proposed:
With HAST you're limited to the number of disks you can put in a pair of machines.
With the ISCSI mesh-like structure you proposed you can theoretically add more machines exporting more LUNs and grow as much as you'd like (to the limit of your network).

That tradeoff in flexibility buys you a tested, proven, documented solution that any FreeBSD admin will understand out of the box (or be able to read the handbook and figure out) -- to me it's a worthwhile trade-off :-)