ESX shut down VMs when some but not all storage paths failed

fibre-channelvirtual-machinesvmware-esx

I've been on hold now for an hour waiting for VMware support and am betting serverfault can beat them to the answer!

I am running ESX 4.0 and 4.1 on 6 HP blades, using FibreChannel LUN storage. We did some FC network maintenance over the weekend and took down 2 of the 4 paths the ESX hosts have to the storage array (EMC Clariion). When this happened, all 6 ESX hosts shut down all of their VMs.

I saw the messages like this in events:

Path redundancy to storage device naa.600.... degraded. Path vmhba0:.... down. 2 remaining active paths Affected datastores: ....

this was expected. then 3 minutes later:

Guest OS shutdown for vm1
(this was by the vpxuser)

vm1 is powered off (user "User")

why would it do this if there were still good paths? I don't see any setting like this anywhere. thanks!

Best Answer

As we figured out in the comments, this seemed to be and actually was HA isolation response.

To provide a bit more value to the answer: to avoid such mishaps, I recommend setting up another network path for HA by configuring a service console (ESX)/management port (ESXi) that would utilize a path completely separate from your main network stack (vSwitch, pNICs, physical switch, UPS, power circuit).