VMware HA memory overcommit

vmware-esxivmware-vsphere

I'm trying to set up VMware HA for the cluster, and having trouble understanding how resource monitoring works. We overcommit memory as a general practice, so out provisioned memory is always ~1.5 higher on individual VMs.

So I created a cluster with 2 Hosts in it, and one was ~90% full in terms of memory (and I mean used memory, as provisioned was ~140%). Second host ran no VMs. Tried powering on one VM – and I got an error, saying that would make it impossible to tolerate one host failure.

Reading more, I found that when this happens and you disable the policy to prevent power on, VMware will not guarantee failover for all hosts.

But does it mean that it will just not try if it thinks that there is not enough resources?
Or does it mean that something bad might happen because memory usage will go over all available, and it'll have to start swapping?
How does it make such decisions?

Best Answer

This is normal behavior. From vCenter Server right click on Cluster > Edit Settings > VMware HA and check "Disable: Power on VMs that violate availability constraints." That would fix your issue.

Basically in 2 node HA cluster, when one node is down, VMs have no machine to failover if the second node also fails and you are disabling that check and that is normal for 2 node clusters.

If you had 3 or more nodes in a cluster, then you can keep that option enabled.

Admission Control and Slots

The "slot" mechanic is intended to have a generic way to break the cluster into an estimate of VM-sized chunks, then make sure that with the loss of the configured number of hosts, there will still be slots available for every VM running.

Calculating the slot size

The first thing that happens in the slot size calculation. What it's looking for is reserved resources, or more specifically the biggest VMs in your cluster in terms of reserved resources. The reason it's doing this is to make sure that each resource slot in the cluster is able to provide enough resources to satisfy the resource reservations on that biggest VM, so that it doesn't have degraded resources below its reserved minimum in the event of an HA failover.

For CPU, the highest reservation on a VM is found and used as the slot size. If there is no reservation, the minimum is used, which is configured in the cluster's das.vmCpuMinMHz setting; the default is 256Mhz in 4.1, but dropped to 32Mhz in 5.0.

For memory, the memory reservation plus the memory overhead for the VM is used - so, with no reservations, your slot size will be the larger number between the memory overhead number of your highest VM or the cluster's das.vmMemoryMinMB setting, if it's configured (the documentation currently says the default is 256MB; that's not true, the default is 0 in 4.1+).

Those numbers combine to give you the slot size for the cluster.

Assuming you have no reservations, your slot size will likely be the minimums - 256Mhz for CPU, and the memory slot size will be the memory overhead size of the VM with the most overhead.

At the opposite end of the spectrum, if you had one VM with a massive memory reservation but no CPU reservation, and another VM with a massive memory reservation, your slot size will be calculated based on the large number on each of those reservations - each slot would be very large in both resources, which would be very limiting - you would be blocked by admission control from powering on VMs long before you reached a concerning level of resource usage.

To combat that particular problem, you can manually set an upper limit on the slot size for each resource:

das.slotCpuInMHz sets the maximum size on the CPU part of the slot calculation
das.slotMemInMB sets the maximum for memory

If you use those, the VMs over those numbers will be assigned multiple slots, so that they will still be guaranteed to have their reservation worth of resources after a failover.

Counting Slots and Enforcing Limits

Once your slot size has been determined, the number of slots on each host in the cluster is counted.

The lower resource determines their limit - so if a host can fit 100x the CPU slot size in terms of CPU resource, but can only fit 30x the memory limit, the host has 30 slots.

That number is added up for each host in the cluster. That's when the configured admission control limit kicks in. If you've configured the cluster to tolerate 1 host failure, then it drops the host with the most slots from its calculation; if 2 host failures are set, the the top two slot count hosts are dropped. It assumes that you'll lose your biggest hosts.

Once that's done, the slot counts on the remaining hosts in the cluster are added up - and that's the number of slots that you're allowed to have VMs running in before it'll block you from powering up a VM.

The "Advanced Runtime Info" link in the cluster's summary tab will tell you what your slots are set to.

Runtime Info

(Ignore the vCPU count; that's no longer used for slot calculation)

Does that work?

I know what you're thinking.

"Wait a second, my VMs have a couple gig of RAM apeice! How in the heck are they supposed to run in a 'slot' of some arbitrary tiny size like 256MB?"

They're supposed to run at the minimum level specified by their resource reservation; if they have no reservation, then they're not necessarily supposed to run well.

If you have an HA failure when you already have fairly heavy resource usage, then you can run into serious resource contention when extra VMs are added onto the load.

If CPU resources are what's in contention, that just means there's an effective reduction of available CPU time for all running VMs. This can have a fairly severe impact in some cases - the virtual symmetric multiprocessing used on machines with multiple vCPUs can really start to suffer when there's contention for CPU time.

If memory is the resource that's in contention.. well, that's when it gets interesting.

ESX(i) has a number of techniques to deal with memory contention. There's a great in depth document on them here, but to summarize, the hypervisor takes these approaches:

Transparent Page Sharing
- Looks for duplicate memory pages among the guest VMs. If there are multiple running the same OS, there are good chances that there are duplicates. The extra copies are thrown away and all attempts to read the memory are pointed to a single copy.
- This runs all the time, on a schedule as determined by the Mem.ShareScanTime setting - once per hour is the default.
Ballooning
- An agent running as part of the VMware Tools in the guest VM gets notification from the hypervisor that it wants some memory back. The agent running in the VM starts trying to take memory back from the guest by asking the guest OS for memory, inflating a "balloon" of empty memory usage by the agent. This forces the guest OS to take steps to free memory, which can include clearing out filesystem cache from memory or swapping memory pages to its own swap space. Once a memory page has been successfully grabbed into the balloon, it informs the host that the memory can be reclaimed.
- This is the first action that the host will take when it runs into memory contention.
Memory Compression
- Pages of VM memory are compressed, but still stored in main memory. This means there's still going to be a penalty to access that memory page, but it's still faster than pulling from swap.
- This is used as the last 'good' option before swapping.
Hypervisor Swap
- You've probably notices that when a VM is started, a swap file equal to the VM's memory size is created on the data store. This reservation effectively makes it so that, if absolutely necessary, the entire VM's main memory could potentially reside on disk. If that happened, performance would obviously be terrible - but this option is the last resort. But, entire memory pages are moved to the swap file in the datastore; when accessed, they need to be retrieved from disk.

It's no accident that their minimum slot size is related to the memory overhead of the VMs. That's because the overhead number is really the only memory that's absolutely necessary to run the VM - though "run" may be too strong of a word, if a VM's entire memory ends up in swap.

So, admission control isn't looking to make sure all your VMs are running well after an HA failover - it's looking to make sure that all of your VMs can run.

So, What Should I Do?

Admission control tries to enforce a minimum level of service in the event of an HA failover. But, that's only going when you've actually defined a minimum level of service; a lot of environments don't want or need reservations.

If you're going to use admission control, I'd recommend investigating your slot sizes and nudging them toward values that make sense to you; don't start creating reservations if you don't need them just to influence admission control.

If your slot size is being set at or near the minimums due to a lack of any reservations in the cluster, then nudge it toward being more of a "normal" VM's size for your environment. Set something like this in your cluster's advanced settings:

das.vmCpuMinMHz = 500
das.vmMemoryMinMB = 2048

If your slot size is being set too high due to a small number of high-reservation VMs, then push it down appropriately.

das.slotCpuInMHz = 1000
das.slotCpuInMHz = 4096

Make sure the values that admission control comes up with for slot size make sense for your environment - you definitely don't want to be running half your VMs from swap space because admission control thinks you don't care about their level of service!

Best Answer

Related Solutions

VMWare ESXi free

VMware HA – How to Control Admission Policy

Admission Control and Slots

Calculating the slot size

Counting Slots and Enforcing Limits

Does that work?

So, What Should I Do?

Related Topic