ESXi host losing connection to vCenter

hardwarenetworkingvmware-esxivmware-vcenter

I'm having a very odd issue with a single ESXI host.

I have 2 identical hosts, core i3, 6 nics, 16g ram. 4 of the nics are used for Management, vmotion, vm network, all on different vlans. They all go to a HP Procurve 24 port gig switch in a static trunk. The other two nics are iSCSI.

There are 2 VSS's, the one with 4nics, and the second with just the 2 and iSCSI traffic.

Configuration on both hosts is identical, hardware is identical. Both hosts are running at about 30% utilization both cpu and memory. They are running ESXI v. 5.1.

What is happening is that all of the sudden host 2 will drop out of vCenter. ( vCenter is hosted on a physical machine ). No error, it just loses connection.

If I try to ping the host from vCenter I cannot. If I try to ping from my workstation I can most of the time and I can SSH into it. If I "test management network" from the DCUI it can ping the gateway and the dns servers. If I restart the management network I still cannot get to it from vCenter.

If I do a services.sh restart it all completes with no error but doesn't help, host is still not able to register with vCenter nor be pinged by vCenter.

The only thing so far that remedies this is to completely restart the host. I did a log export but I'm not really even sure what to look for at this point. What logs should I be looking at? The only other piece of information I can add is that this seems to happen at the same time of the day, early in the morning. There is nothing running at this time, no backup jobs etc.

Best Answer

Whenever I see these issues on whitebox hardware, I check the drivers (and firmware) of the critical components involved (NIC, storage) and then suggest updating to the newest revision of the ESXi distribution using the VMware Patch Portal or Update Manager.

Lab or no lab, you're running an old build: ESXi 1065491 versus the current ESXi 1483097.

Go ahead and run the updates as a first start: Are VMware ESXi 5 patches cumulative?

Following that, I would dig into the actual hosts' logs to see what's happening near the vCenter disconnection time. Check /var/log/hostd.log and /var/log/vmkernel.log.

If you're certain that there aren't any firewalling, DNS or other networking issues, this is your best bet to understand what's happening.

If all else fails, this is ESXi, and you have shared storage. Spending time troubleshooting a build like this isn't always useful, especially if the other host is performing well. Copy your settings off via PowerCLI, rebuild and restore the host.

Related Topic