ESXi server reverted to installation defaults on reboot! How to this happen

vmware-esxi

Our shop is primarily ESX 4.1, which I and others are very familiar with, but we also have a few test ESXi 4.1 servers running the free version of ESXi, originally installed using the 60-day evaluation version, but now using a "free ESXi" license key from our VMware management account.

All of these servers are Dell R610 with 32GB RAM, single X5450 CPU, and RAID1 136GB local disk. ESXi is installed to the local disk, with the remainder configured as a VMFS volume. No shared storage is being user.

As of Friday at 18:00, all of the servers were running properly.

As of Saturday at 15:30, one of the servers appeared to have been re-installed.

Two of these servers are located in our office, where the weekend admin staff performed a power-off test this Saturday. This test consisted of literally throwing the breaker for the entire building. None of the servers in question are attached to a UPS, though they do have write cache and batteries on their RAID controllers.

When the machines booted after the test, one of them lost the free license key (reverting to an expired evaluation license), and the other reverted to initial installation settings (DHCP, no password, empty inventory), and the evaluation license had reset, giving another 60-day evaluation period from Saturday at 15:30.

The first of these servers was fixed by simply re-entering the free license key through VIclient. All inventory and settings were in the same state they were in on Friday.

The second of these servers was in exactly the state expected after a new installation or re-install; namely, all settings had reverted to defaults, and there do not exist any log or configuration files dated prior to the power-off test. Logging in through the unsupported service console shows that the very folders in the root directory were also dated from after the power-off test.

However, the VMFS volume contents were intact, just as if somebody had performed a "repair" installation from CD-ROM.

This server was repaired by following our standard checklist for repair installations: configuring the network, adjusting server settings, and re-adding machines to inventory from the datastore browser.

Question: is there anything besides a manual repair install which will reset an ESXi server to its original installation defaults, and set all service console folders, configuration files, and log files to the date and time that the server booted?

Yes, I am aware that in a diskless installation, this is pretty much what happens on every boot; however, this is not a diskless installation, but rather is installed and boots from the local disk.

However, I am not familiar enough with ESXi to know if this is also normal for an installation on disk.

Tests: Since both servers are configured identically, we used the first server to try to find out what happened to the second.

  1. I did another power-off test of that server only, to see if it also reverted to defaults when it booted. It did not; it retained all settings and booted normally, twice. (Unfortunately, we did not check to see if the folder, configuration, and log files were reset to the boot time.)

  2. I did a repair install to verify that the configuration and log file dates would all be updated to the time of the re-install, and that all configurations would be reset to defaults as happened with the second server. It did; after a repair install the first machine was in exactly the same state as the second had been, and the evaluation license was similarly reset to a new 60-day period.

Follow-up question: Assuming this happened without user intervention, why did it happen and how can we prevent it from happening again?

Real question: should I believe the weekend admins who said they didn't do anything to the machine? None of them are certified on our VMware systems, but know enough to be dangerous if they got it in their head to try to fix a problem.

Please tell me that I'm wrong in thinking the weekend admins are hiding something they did wrong, and that this can happen for reasons other than manual intervention.

Best Answer

When dealing with people who do not know what they are doing, it is helpful to remember Hanlon's Razor: "Never attribute to malice that which is adequately explained by stupidity."

If you have set up an unattended PXE-based install or some other automatic installation facility for ESXi (e.g. with an inserted CDROM or USB stick), your power cycle might have triggered this for some reason.

Related Topic