ESXi Server Health Monitoring

monitoringvmware-esxivmware-vsphere

As VMware has stated, now is the time! I have started to read up on and plan for our upgrade from vSphere ESX 4.0 to vSphere ESXi 4.1. While I know vSphere 5 should be out sometime this Fall I am pretty sure this initial planning will apply to that version as well. One of my major concerns is that I want to be able to effectively monitor the health status of our hosts. My question is two parts: 1) Should my current setup still work, 2) What are some other suggestions?

My current setup to monitor the health status of our servers, and alert on failures, is a combination of iDRAC6 alerting and WUG (Whats Up Gold) catching SNMP traps. The iDRAC6 can send to the SMTP server and send email if something physical, except for storage events, degrades or fails on the server. The servers are also configured to send SNMP traps to WUG, which does monitor storage events and is a secondary notification on other events. To get this setup I edit the SNMPD.CONF files via the service console, which of course is going away. It looks like the new method to do this, if I try to continue this, is detailed in this VMware KB. Is anyone using the SNMP traps setup to monitor their hardware and done the setup that is described?

The second part to my question is; could there be a better way to monitor the health status of my hosts? I know that there are other methods but, without being argumentative, what are other ways, that might even be better, to monitor the health status of hosts? I have been looking at CIM but I am not sure what sits on the other end and interprets what CIM is saying is wrong. What methods is everyone else using to get this data?

Best Answer

I use the data coming out of the (i)DRAC, combined with the data that ESXi harvests via CIM, with vCenter configured to alert on faults coming out of the CIM monitoring.

I'm a little unclear on what you're saying about the trustworthiness of the CIM data, but I personally trust it a heck of a lot more than I would trust the SNMP traps being fed to WhatsUp. CIM will catch and throw alerts on something as minor as low voltage on the BIOS battery, as long as your hardware is well supported (as recent Dell equipment is), and vCenter is pretty flexible about choosing what, where, and how often you throw alerts on those events.

Related Topic