How much contention is too much in VMware

virtualizationvmware-vsphere

For a while now I've been trying to figure out why quite a few of our business-critical systems are getting reports of "slowness" ranging from mild to extreme. I've recently turned my eye to the VMware environment where all the servers in question are hosted.

I recently downloaded and installed the trial for the Veeam VMware management pack for SCOM 2012, but I'm having a hard time beliving (and so is my boss) the numbers that it is reporting to me. To try to convince my boss that the numbers it's telling me are true I started looking into the VMware client itself to verify the results.

I've looked at this VMware KB article; specifically for the definition of Co-Stop which is defined as:

Amount of time a MP virtual machine was ready to run, but incurred
delay due to co-vCPU scheduling contention

Which I am translating to

The guest OS needs time from the host but has to wait for resources to become available and therefore can be considered "unresponsive"

Does this translation seem correct?

If so, here is where I have a hard time beliving what I am seeing: The host that contains the majority of the VMs that are "slow" is currently showing a CPU Co-stop average of 127,835.94 milliseconds!

Does this mean that on average the VMs on this host have to wait 2+ minutes for CPU time???

This host does have two 4 core CPU's on it and it has 1×8 CPU guest and 14×4 CPU guests.

Best Answer

I can describe some of the experiences I've had in this area...

I don't believe that VMware does an adequate job of educating customers (or administrators) about best-practices, nor do they update former best-practices as their products evolve. This question is an example of how a core concept like vCPU allocation isn't fully understood. The best approach is to start small, with a single vCPU, until you determine that the VM requires more.

For the OP, the ESXi host server has two quad-core CPUs, yielding 8 physical cores.

The virtual machine layout being described is 15 total guests; 1 x 8 vCPU and 14 x 4 vCPU systems. That's way too overcommitted, especially with the existence of a single guest with 8 vCPUs. It makes no sense. If you need a VM that big, you likely need a bigger server.

Please try to right-size your virtual machines. I'm pretty certain most of them can live with 2 vCPU. Adding virtual CPUs does not make things run faster, so if that's a remedy to a performance problem, it's the wrong approach to take.

In most environments, RAM is the most constrained resource. But CPU can be a problem if there's too much contention. You have evidence of this. RAM can also be an issue if too much is allocated to individual VMs.

It's possible to monitor this. The metric you're looking for is "CPU Ready %". You can access this from the vSphere client by selecting a VM and going to Performance > Overview > CPU Graph.

  • Under 5% CPU Ready - You're fine.
  • 5-10% CPU Ready - Keep a close look at activity.
  • Over 10% CPU Ready - Not good.

Note the Yellow line in the graph below. enter image description here

Would you mind checking this on your problem virtual machines and reporting back?