Unresponsive ESXi 5.5 server

clustervmware-esxivmware-vsphere

I have a host that is part of a 4 host cluster in HA.

Sometime yesterday I noticed the host stopped responding, in the vsphere console it shows up greyed out as (not responding) and all VMs on it show up as (inaccessible). The VMs them self are still running normally, I can remote desktop to them and everything is up. There are critical servers on this machine. I have tried to right click the host and "Connect" after a few hours it simply fails. I cannot move the VMs on it, all actions are greyed out. On the Host pressing F2 gives me the login prompt, after entering my credentials nothing happens. ALT+F1 doesn't let me do anything as it's not enabled. SSH is not enabled. With ALT+F11 I can see that hostd has crashed, that's probably the problem. I have called Vmware as I have full support but after a very short call he said there's nothing to do but to forcefully shutdown the host.

I would rather not do that, I would like to restart the hostd but I can't seem to have any access. I tried PowerCLI but connection to the host times out. Vsphere directly to the host also times out. Pinging the host works, so there is network at least.

Anyone know any other way to get the shell?

Thanks.

More info: Running ESXi 5.5.0 1331820, on a Dell PowerEdge R720, Dell PERC H710

I checked the DRAC and the local volume is healthy. It's actually only a raid 1, all VMs are on a SAN. The vmware esxi welcome page works, but if I click on "browse datastores in this host's inventory" it never shows up. The mob seems to be working properly also "hostip/mob/?moid=ServiceInstance&doPath=content";

On the ALT+F11 console:
2014-09-11T7:15:02.329Z cpu12:57750311)hostd detected to be non-reponsive

The same line, different time and cpu 11 times.

Best Answer

This sounds like a local storage issue to me. I worked in an environment with hundreds of ESXi hosts who ran on local RAID storage. Unfortunately, the local storage controllers in the hardware were unstable... a toxic mix of bad LSI firmware revisions, defective backplanes and Supermicro hardware.

But the behavior your describing is indicative of a local storage issue. Your running VMs are in RAM, the network stack is unaffected, but the ability to manage the host is compromised. Your login doesn't work because the host can't read from local disk. The same thing for any other commands that require disk access.

Your best option here is to schedule an orderly shutdown of the VMs (from within the guest operating systems). From there, manually fail the host (power off, reboot, etc.) Let it remain in maintenance mode or outside of the cluster selection. Power your VMs on and allow them to run elsewhere in the vSphere cluster.

If you're interested in debugging the host's issues, check the Dell DRAC for information about storage array status. That will point you in the right direction.