How to handle a (VMware ESXi) server crash

vmware-esxi

I have a dedicated server (Core 2 Duo E4600, 2GB DDR2, LSI Raid 1 with 250GB SATA storage). running VMware ESXi 3i (3.5.0) and 3 VMs (1x Ubuntu 9.04, 1x Ubuntu 9.10, 1x Windows 2003 Web Edition)

This afternoon it suddenly stopped responding. VMware Infrastructure Client couldn't connect, Remote Desktop couldn't connect, SSH couldn't connect. Tried different internet connections etc. After a few minutes I decided to do a remote power cycle and that got everything up and running again.

Now I'm wondering: What is the right way to analyze or debug this kind of server crash?

The ESXi event log started with a clean sheet, so nothing there. The virutal machines (linux syslog, windows event logs) don't report anything special and the machine really has mediocre load overall.

What are the places to look? Can I enable more logging somewhere so I can investigate possible future crashes?

Best Answer

When rebooting after a crash, ESX usually creates a vmkernel-zdump file in the /root home directory. This is a compressed file that has an image of core and a chunk of the /var/log/vmkernel log file. The first thing to do is get the log file from this dump file

[root] vmkdump -l vmkernel-zdump-101409.14.18.1
created file vmkernel-log.1

and look at the last few lines to see if you can get any hints from the last log entries or the stack trace.

Related Topic