It was the backplane. Both drives of the RAID1 and one drive of the RAID5 were inaccessible. Incredibly, the VMware hypervisor continued to run for three days from memory with no access to its host disk, keeping the VMs it managed alive.
At step 3 above we diagnosed the hardware problem and replaced the RAID controller, cables, and backplane. After restart, we re-initialized the RAID by instructing the controller to query the drives for their configurations. Both were degraded and both were repaired successfully.
At step 4, it was not necessary to reinstall ESX; although, at bootup, it did not want to register the VMs. We had to dig up some buried management stuff to instruct the kernel to resignature the VMs. (Search VM docs for "resignature.")
I believe that our fallback plan would have worked, the VMware Converter images of the VMs that were running "orphaned" were tested and ran fine with no data loss. I highly recommend performing a VMware Converter imaging of any VM that gets into this state, after shutting down as many services as possible and getting the VM into as read-only a state as possible. Loading a vmdk either elsewhere or on the original host as a repair is usually going to be WAY faster than rebuilding a server from the ground up with backups.
The IC doesn't know if you've got those disks shared an in use by another ESXi host, it's very common to have a SAN on the backend and multiple hosts accessing the same storage device. In this case there's no way to know which hosts are accessing which machines, the scenario you describe only makes sense if you've got a single host - which is not the typical scenario for many of VMware's corporate customers.
Using the RCLI or shell you could iterate through all existing machines and then compare that to a list of what's on disk. If you've got disks shared between hosts however, then things become a lot more complicated and you'd need to iterate through the devices on each machine as well.
Update: Right, now it's more of a nuts n bolts scripting/programming question ;)
Starting with the RCLI documentation I'd probably do something like using vmware-cmd -l
to list all registered machines on the host. Then using vifs
download the config files, grep
through those looking for mentions of virtual disks (.vmdk
) and storing all those in a file.
Part two, would be writing a script to do a recursive directory listing, again using vifs
, running grep
again on that to only include .vmdk
and .vmx
files. Now you've got two lists, pipe these through sort
and then diff
the results to find out what .vmx
files are not registered on the machine and what .vmdk
files are not in use by any active VM. And then you have your candidates for deletion :)
Best Answer
You may have problems running the image locally, you'll want to use VMWare converter to transfer the image in a format you can use (in VMWare server etc).
Otherwise, use vifs in the remote cli: http://www.vmware.com/pdf/vi3_35/esx_3/r35u2/vi3_35_25_u2_rcli.pdf