The idea behind optimizing stripe sizes is to optimize it so, that in your typical workload most read requests can fulfilled by (a multiple of) a single read, evenly divided over all data disks. For a three disk RAID-5 set, the amount of data disks would be two.
For example, let's assume my typical workload makes I/O read requests that are on average 128kB. You would then want to make chunks of 64kB if I have a three disk RAID-5 set. Calculate it like this:
avg request size / number of data disks = chunk size
128kB / 2 = 64kB
This is the chunk size of your RAID set, we haven't arrived at doing the filesystem yet.
Next step would be to make sure the filesystem is aligned with the RAID set's characteristics. Therefore, we want to make sure the filesystem is aware of the chunk size of the RAID set. It can then evenly distribute superblocks over the three disks.
For this, we need to tell mke2fs what the size of a chunk is or, more exact, how many filesystem blocks will fit a chunk. This is called the 'stride' of the filesystem:
chunk size / size of filesystem block = stride size
64kB / 4kB = 16
You can then call mke2fs with the -E stride=16 option.
The page mentioned earlier also talks about the -E stripe-width option for mke2fs, but I have never used that myself, nor does the manpage of my version of mke2fs mention it. If we would want to use it though, we would be using it like with 32: the stripe width is calculated by multiplying the stride with the amount of data disks (two, in your case).
Now to the core of this matter: what is the optimal chunk size? As I described above, you would need the average size of an I/O read request. You can get this value by checking the appropriate column in the output of iostat or sar. You will need to do that on a system that has a comparable workload as the system you are configuring, over a prolonged period.
Make sure that you know what kind of unit the value uses: sectors, kilobytes, bytes or blocks.
I'd really recommend you take a look at ZFS, but to get decent performance, you're going to need to pick up a dedicated device as a ZFS Intent Log (ZIL). Basically this is a small device (a few GB) that can write extremely fast (20-100K IOPS) which lets ZFS immediately confirm that writes have been synced to storage, but wait up to 30secs to actually commit the writes to the hard disks in your pool. In the event of crash/outage any uncommitted transaction in the ZIL are replayed upon mount. As a result, in addition to a UPS you may want a drive with an internal power supply/super-capacitor so that any pending IOs make it to permanent storage in the event of a power loss. If you opt against a dedicated ZIL device, writes can can have high latency leading to all sorts of problems. Assuming you're not interested in Sun's 18GB write optimized SSD "Logzilla" at ~$8200, some cheaper alternatives exist:
- DDRDrive X1 - 4GB DDR2 + 4GB SLC Flash in a PCIe x1 card designed explicitly for ZIL use. Writes go to RAM; in the event of power loss, it syncs RAM to NAND in <60sec powered by a supercapacitor. (50k-300k IOPS; $2000 Direct, $1500 for .edu)
- Intel X25-E 32GB 2.5inch SSD (SLC, but no super cap, 3300 write IOPS); [$390 @ Amazon][11]
- OCZ Vertex 2 Pro 40GB 2.5inch SSD (supercap, but MLC, 20k-50k write IOPS); $435 @ Amazon.
Once you've got OpenSolaris/Nexenta + ZFS setup there are quite a few ways to move blocks between your OpenSolaris and ESX boxen; what's right for you heavily depends on your existing infrastructure (L3 switches, Fibre cards) and your priorities (redundancy, latency, speed, cost). But since you don't need specialized licenses to unlock iSCSI/FC/NFS functionality you can evaluate anything you've got hardware for and pick your favorite:
- iSCSI Targets (CPU overhead; no TOE support in OpenSolaris)
- Fibre Channel Targets (Fibre Cards ain't cheap)
- NFS (VMWare + NFS can be finicky, limited to 32 mounts)
If you can't spend $500 for evaluation, test with and without ZIL disabled to see if the ZIL is a bottleneck. (It probably is). Don't do this in production. Don't mess with ZFS deduplication just yet unless you also have lots of ram and an SSD for L2ARC. It's definitely nice once you get it setup, but you definitely try to do some NFS Tuning before playing with dedup. Once you get it saturating a 1-2 Gb links there are growth opportunities in 8gb FC, 10gigE and infiniband, but each require a significant investment even for evaluation.
Best Answer
The following mount options should be suitable:
Also I think it is always makes sense to use "journal_checksum" but on modern systems it will be used by default.