CRITICAL STRUCTURE CORRUPTION on Windows Server 2012 R2

server-crasheswindows-server-2012-r2xen

I have a Windows Server 2012 R2 virtual machine; yes with all the updates. Additional software include Microsoft SQL Server 2014 (was 2012 on a previous VM). The web hosting company includes xenpci.sys (EJBPV XenPCI Driver (Checked Build), James Harper) as part of their default installation on all VMs and Plesk.

Periodically, the OS either hangs, blue screens, or reboots. I do get mini dumps, though not all the time. The usual problem is:

Error: CRITICAL_STRUCTURE_CORRUPTION

The specific top level file, obviously not the cause, varies: win32k.sys, ntoskrnl.exe, xenpci.sys (the Xen driver, though only showed up a couple of times), and ndis.sys.

The OSR (Open System Resources) analyzer was not of much help. The WhoCrashed analyzer was a bit more helpful.

It stated:

17 crash dumps have been found and analyzed. Only 10 are included in this report. A third party driver has been identified to be causing system crashes on your computer. It is strongly suggested that you check for updates for these drivers on their company websites. Click on the links below to search with Google for updates for these drivers:

xenpci.sys (EJBPV XenPCI Driver (Checked Build), James Harper)

I tried to push the web hosting company to research the topic, but they can up empty handed. I am not convinced that the Xen drivers are at fault. WhoCrashed picked up on it, I presume merely because that was the last driver a couple of times and it is a third party, so that makes it guilty. I did not write WhoCrashed, so hard to comment further.

My question is how to troubleshoot the problem.

The web hosting company already tried giving me two new virtual machines throughout the last couple of years. The problem migrates. I installed SQL Server, but the OS and Plesk came by default. Okay, there is the mail server software too. The web hosting company also told me that they do not have other clients similarly complaining. They ran disk tests multiple times. Disk health is good.

I did not check the registry's health, but the problem goes across installations and happens pretty routinely, so I would have to discount that. I am on my third or fourth VM now.

Again, I mention Xen because WhoCrashed mentioned it, but I am not convinced that as the cause, and other clients really do use that. The system has adequate memory and storage, so that is not a problem.

UPDATE:
Here are some answers from the web hosting company to my query.

In usual scenario, performance of the VM will get degraded once you uninstall the drivers.
There might be some synchronization issues with the Hardware Node.

Am I using a checked or release build?

You are using a test-signed build, the same ones from developer's site.

How can I tell? The Xen PCI properties dialog in Device Manager did not say one way or the other.
Is the entry in Device Manager the sole location? I checked in Programs and Features and saw nothing listed.

You can check the version under Add or Remove programs.
Refer to the snapshot attached.

How/where I can I find where the latest version is on their site?

Developer's site is not working – http://www.meadowcourt.org/downloads/
You can donwload the latest signed releases from here –
http://wiki.univention.de/index.php?title=Installing-signed-GPLPV-drivers

How can I tell which Xen, 0.11.0.373 belongs to (Xen 4.6? 3.0? x.y?)

We are using Xen 3.4.4, you can't see it from your VM. It can only be viewed from hardware node.

Update 2:
The hosting company installed two James Harper software.

GPL PV Drivers for Windows
EJB PV Drivers for Windows

Best Answer

xenpci.sys (EJBPV XenPCI Driver (Checked Build), James Harper)

(Checked Build) is a huge red flag. You absolutely should not be using "checked" builds of anything in production. If your hosting company loaded this driver for you, then they absolutely made a mistake.

Checked builds include extraneous symbols and extra error checking that aid developers. They are not production builds.

To further elaborate, what this tells me is that whatever error is causing the machine to stop probably still occurs in the un-checked build of the driver, however, it's probably just causing a non-fatal side effect such as a memory leak in the release build. But in the checked build, because of the more stringent error checking, it stops the entire OS. That's the point of checked builds, to accentuate errors and shove them in the developer's faces, before they ship the code to customers.

To further elaborate, it doesn't really matter if other VMs also have that same exact driver loaded (the checked build) and don't seem to be crashing. Some component specific to that VM is invoking some certain behavior or state that is triggering the bug in that driver. (Drivers and applications interact in all sorts of ways, maybe two machines have the same buggy driver loaded, but only one of the servers has SQL installed, and since the server has SQL installed it does this unique memory page locking in a way that the other server doesn't do, which causes the 3rd party driver bug to rear its ugly head. (Just an example.))

There's really no where else to place the blame here. You cannot run checked builds of drivers in production and expect to have a good time. They're only for development and testing purposes.

Lastly, the only other place to go from here would be to collect full dump and run it through WinDBG. You can spend six hours of intense debugging, unwinding stacks, tracing threads, following IRPs to their completion ports... or you can just get rid of that checked build driver. :)

Might also try running the driver through Driver Verifier. In a test environment. Where checked builds should stay. ;)

Related Topic