hyper-v – Fixing INACCESSIBLE_BOOT_DEVICE on Hyper-V 2012 R2 Virtual Machines

bootbsodhyper-vhyper-v-server-2012-r2windows-server-2012-r2

I have a Hyper-V 2012 R2 cluster, 4 Dell PowerEdge R620 servers connected to a Dell PowerVault MD3600F storage array by means of FC connections; it's all pretty straightforward, all servers run WS2012R2, the cluster was freshly built a couple of months ago, all drivers and firmwares are up to date, Windows is updated to the latest available patches (even those released two days ago). There is also a SCVMM 2012 R2 server managing the whole thing, but this doesn't seem to really matter for the problem at hand.

There are several VMs running on this cluster; some of them are generation 1 VMs running Windows Server 2008 R2, while most of them are generation 2 VMs running Windows Server 2012 R2; those, too, include the latest available updates; they have actually been deployed from a template which was built soon after the cluster, and is periodically updated when Microsoft releases new patches.

Everything works pretty well, but sometimes (i.e. with no discernible reason or cause) a VM will fail to boot, crashing with the dreaded INACCESSIBLE_BOOT_DEVICE error code; this only happens upon booting (or rebooting): no VM has ever crashed while running.

Whenever this happens, there is no way to make the faulty VM boot again; this happened the first time two weeks ago with a VM which was not running any production workload yet (it was freshly deployed); we were quite in a hurry to get it to work, thus we simply scratched it and deployed a new one; but the root cause of the problem was not found.

Then it happened again two days ago, when we rebooted several VMs after patching them; three of them didn't come back up, while some other ones booted without any problem.

The faulty VMs are unable to boot even in safe mode; however, when booting into Windows Recovery Environment (from the system itself, thus from the local (virtual) disk, not from a Windows DVD; meaning the virtual disk can indeed be accessed), everything seems to be ok: the boot manager correctly lists the system to be booted (the output of bcdedit /enum all /v is actually identical to that of a working VM), all volumes are accessible, and even chkdsk shows no error at all. The only anomaly is, when running bootrec /scanos or bootrec /rebuildbcd, the tool says it's unable to find any Windows installation (although the C: volume is there and it's perfectly readable).

This only happened (at least so far) with WS2012R2 generation 2 VMs, thus I'm assuming it's caused by some problem in the EFI emulation and/or the EFI bootloader; however, this is only an assumption on my part.

The reason I mentioned updates is because I'm aware this happened before, and KB2919355 was responsible for it; also, Microsoft recently released another mega-update, KB3000850, and this also was applied both to the hosts, the virtual machines and the WS2012R2 template.

(Coincidentally, the day after this update was released, Microsoft experienced a worldwide crash of the whole Azure cloud platform, which bears some striking resemblances to what's happening to our cluster; but I'm just throwing guesses around here).

I've already opened a support case with Microsoft, but I'm also posting here, maybe someone can help; of course, if Microsoft provides a solution, I'll post it as soon as the VMs are back online.

Best Answer

We escalated the problem up to Microsoft Premier Support and got a kernel debug specialist working on it; he discovered that something uninstalled all Hyper-V drivers from the guest VMs, thus rendering them completely unable to boot; he managed to get one of them to boot by manually injecting the drivers in the file system and Registry of the VM, and we were able to get back some critical data (it was a Certification Authority); however, the VM was now in a completely unsupported state, and thus we decided to rebuild it; we also rebuilt all the other VMs, which had no critical data on them.

As for what actually caused the driver uninstallation, the case is still opened, and the cause has not been found yet; the problem was latent in the template we used, because it sooner of later affected all the VMs which had been deployed using that template; we built another template, and this one didn't show the same issue, so we are running fine now... but we still don't know what caused the problem in the first place.


Update:

After a while, we FINALLY found what happened (I just forgot to update this answer before).

It looks like someone or something forcibly updated the Hyper-V Integration Services in the base template, which already had them, being based on the exact same O.S. release of the hosts; this caused a latent issue in the guest system where those drivers would be marked as duplicate and/or superseded, and thus in need of being removed; but this event would only trigger after a variable time interval, when Windows performs some periodical automated cleanup process. This eventually led to the complete uninstallation of all Hyper-V drivers on each VM instantiated from that template, rendering it completely unable to boot.

As for who or what performed this update (which can't be done by inserting the Integration Services setup disk and running its setup, because the installer correctly detects the drivers are already installed and exits), we still have no clue. Either someone who should have known better did it manually using PowerShell or DISM, or SCVMM was the culprit.