VM stuck at boot – delete CTK file even if snapshots are present

vmware-esxivmware-vcentervmware-vsphere

I am tasked with recovering a VMWare 6.5 cluster that, after an unexpected power failure, has a VM (the most important one…) stuck at boot.

From the vmware.log file, it seems the problem is related to a corrupted CTK file and, as I read on this vmware KB, it should be sufficient to remove the affected CTK file (ok, not really so simple, but simple enough…)

However the affected VM has some snapshots active and, as I read on another (older) KB, such a procedure should not be attempted if snapshots are present.

What is the right path/procedure to unstuck the VM and letting the boot process to complete?

Best Answer

In this case, the solution was the simplest, yet strangest, possible: to wait for the night. After some hours, both VM "unstuck" and correctly booted.

Regarding the change-tracking-file (CTK) question, I simulated the problem with a spare VMWare hypervisor and, after reading VMWare own documentation (quite light on details...) I think the key point it that you can delete the CTK files even if the virtual machines has active snapshots, but such changes can corrupt any subsequent CTK-aware backups. So, in such cases, you also need to disable CTK on VM and disk level, consolidate any snapshots, do a full backup, re-enable CTK (again, both on VM and disk level) and re-enable incremental backups.

Disabling CTK seems to have effect on the last CTK file only (note: a CTK file exists for each VMDK flat and delta files, so each snapshot commands a new CTK file) and this seems to be the reason VMWare recommend to have no snapshots when enabling/disabling block change tracking. From here:

Note: Ensure that there are no snapshots on the virtual machine before enabling change tracking. If you create snapshots before enabling CBT, the QueryChangedDiskAreas API might not return any error or the data returned by QueryChangedDiskAreas might be incorrect.