Disk IO cutting out on Windows Server 2008 R2 running in VMWare

hard drivevmware-esxvmware-esxivmware-vspherewindows-server-2008-r2

We've been trying to fix a problem with our server for about a week now. Only in the past few days have we really found out what the issue is.

Just some information up front: We run a VM with a 3rd party web host. I know that they use VMWare, but do not know what specific version. They just updated our VMWare Tools to the latest version last night. The server OS is Windows Server 2008 R2 and should have the most recent Windows updates on it. We use this as a web server, so it runs IIS 7.5 and we run Coldfusion 9.0.1 on top of that. Coldfusion 9 should be using the latest version of JDK 6 (I think there are/were compatability issues with Java 7).

What we are seeing are small periods of time, anywhere from 30 seconds to 2.5 minutes where the server basically comes to a "halt". It doesn't really lock up, but the CPU drops to almost 0% usage, and no web requests get handled.

Using the Windows Performance Monitor, we have discovered that when this happens, disk IO appears to drop off completely. Attached are images of graphs pulled from the Performance Monitor.

The first graph shows when this occurs. Notice that the Disk Idle % (green line) drops to 0. I assume this means that disk access is at full capacity. The CPU drops to near 0%, with occasional spikes. The purple line is the Disk Queue Length, which I assume shows how many disk IO operations are pending on the system. This is typically very low, around like 1 or 2, often times 0. When this phenomenon occurs, this increases dramatically (which makes sense if there is something wrong with disk access).

The second graph shows when things come back up. The CPU is pegged as it is started to chomp away at a queue of web requests and other things that got backlogged, but the disk stats go back to "normal".

Not every time, but when this happens and the outage is very long (a few minutes), we also see some warnings recorded in the Windows System event log. The Source is "LSI_SCSI" and the Event ID is "129" with a general message of "Reset to device, \Device\RaidPort0, was issued."

When this first started happening, we thought it was something with our code, but after seeing this all happening, we feel it is something either with the OS or in regards to the VM/VMWare. I don't think it is load related, for if it were I would think we would be seeing both high disk usage AND high CPU usage. The fact that the CPU is low kind of leads me to believe that processes are just blocked waiting on IO requests to return. We are working with our hosting provider as I write this to figure this out, but I thought I would try here for ideas. Thanks in advance for any help!

First Graph
First Graph

Second Graph
Second Graph

Best Answer

Your provider may be running your VM on an overloaded host. If someone else's vm ties up the disk, you're just going to be stuck until their VM frees up. Nothing you can do about it except making a lot of noise.