How do we quantify the impact of a lower memory speed on a VMware ESXi host

memoryperformance-monitoringvmware-esxi

We have an IBM x3650 M2 server. It has 2 Intel X5570 CPUs (quad-core Nehalem). The current memory configuration is 12x 4GB DIMMs for a total of 48GB, running at 1066MHz.

To increase the amount of memory to 72GB I can populate the 4 free slots with 4GB DIMMs, but this has the side effect of dropping the memory speed down to 800MHz. Our vendor is strongly advising against this, and suggests replacing some of the 4GB DIMMs with 8GB DIMMs in order to maintain the memory speed at 1066MHz.

I've got 4x 4GB DIMMs on loan from the vendor, so I'm looking for a way to quantify the difference.

The box is running ESXi 4.1 and it's in a cluster with 2 other ESXi hosts, both of which are newer X3650 M3 boxes.

What can I do to test whether the change in memory speed has a significant impact? If it matters, I can take this host out of the cluster for the duration of testing.

EDIT: I should've been clearer…. I would like to determine if there's any impact on running VMs, so installing a "real" OS or using memtest86 won't really do much to further my goal.

Best Answer

The impact is greater than if you were running an older CPU with no support for EPT, but realistically the only way to determine if it's going to affect your workload is to actually profile the workload. Don't take the hypervisor out of the equation, because the whole point is to test the setup that you're running, not some hypothetical benchmark figure. Ignore Memtest86+, ignore bare-metal OSes, just find a virtual machine that's representative of a memory-intensive workload in your environment and beat the crap out of it.

Guy is dead-on: most consolidated workloads are bound by the amount of memory in the system and not by any other resource. By decreasing memory contention, the extra memory will probably help you out by enabling you to use more of your memory-intensive VMs' RAM as cache instead of it being ballooned out under pressure.

1. Memory compression cache

Memory pages which have been inactive for a while are being compressed and get uncompressed and served upon request instead of being swapped to disk or ballooned. Page compression has a configurable upper limit which is set to 10% of the guest's assigned memory by default and you can roughly estimate a 6% performance decrease when using the compression cache in real-world scenarios according to this VMWare white paper.

2. Page sharing

Virtual memory pages of different guests which have been found to carry identical information are referenced to the same physical memory page. This is an asynchronous operation freeing duplicate memory pages regularly.

3. Memory ballooning

A kernel-level driver in the guest supplied with the VMWare tools will claim memory in the nonpaged memory pool of the guest and mark it as "Free" for the hypervisor. This way, the memory is effectively temporarily "stolen" from the guest, inducing guest-level swapping should the memory really be needed by the guest.

4. Swapping

If everything else fails and more memory is needed, ESXi swaps guest memory pages to disk. The location of the swap file is configurable and is placed in the same directory with the guest configuration files by default.

I have found that for my typical loads page compression and page sharing yield around 10% in memory savings over the memory overhead incurred by ESXi without notable performance degradations. Ballooning will always work, as long as it is configured to (you can effectively turn it off by reserving the entire memory amount to the guest), but basically it is only marginally better than swapping (it is where your guests would have otherwise dynamically claimed large amounts of memory for caching, but if guests are memory-starved already, it can't do magic and will incur disk I/O due to thrashing just as hypervisor-based swapping would have done).

All summed up: if you could configure your guests overcommitting just about 10% and they would continue to run without in-guest swapping and the accompanying performance degradations, you likely would be fine with your 40% overcommitment. If not, you definitely would not.

The output of the memory page of esxtop (just press m after starting esxtop from the SSH console) would inform you about the real-time memory statistics in more detail than the graphs you would get with the vSphere client, so it might be worth looking there:

 1:54:52pm up 34 days  8:39, 214 worlds; MEM overcommit avg: 0.00, 0.00, 0.00
PMEM  /MB: 32766   total:  1031     vmk, 29568 other,   2166 free
VMKMEM/MB: 32103 managed:  1926 minfree, 13525 rsvd,  18577 ursvd,  high state
NUMA  /MB:  8123 (  767),  8157 ( 2425),  8157 (  186),  7835 (  128)
PSHARE/MB:  2162  shared,   139  common:  2023 saving
SWAP  /MB:     0    curr,     0 rclmtgt:                 0.00 r/s,   0.00 w/s
ZIP   /MB:    17  zipped,    10   saved
MEMCTL/MB:   295    curr,   292  target, 14289 max

Why does the ESXi report higher memory usage than I expect

After some (long) conversations with VMware support, I have come to the following understanding:

The number in "Reserved Capacity" is not a function of the memory configuration for the cluster's VMs. It is the sum of several factors: any memory reservations declared on VMs, a value calculated from the HA admission policy, and an additional amount for memory management overhead. The HA admission control value is directly derived from the admission control policy; in my case, since I had it set to tolerate a single host's failure, the total amount of RAM on one of my hosts was added to the cluster's reserved capacity.

Among other constraints, it appears that HA admission control will not allow the reserved capacity to exceed the RAM in a single host. (Either that or it won't allow the available capacity to drop below the RAM on a single host; I'm still not clear on which of these is really the case, since they're the same thing in my two-host cluster.) This has the net result that practically any amount of memory reservation is incompatible with what would otherwise seem to be natural settings for HA admission policy in a two-host cluster. Since Fault Tolerance forces memory reservations, that makes it similarly incompatible. I was told that if there were more hosts in the cluster, the reserved capacity would be "spread out" across more of them and some degree of memory reservation would be possible.

The net result for me is that I had to change my HA admission policy to reserve a percentage of the available resources (instead of "one host's worth") and calculate that percentage to exclude any memory reservations necessitated by the use of Fault Tolerance.

Best Answer

Related Solutions

Esxi Host: acceptable memory overcommitment

1. Memory compression cache

2. Page sharing

3. Memory ballooning

4. Swapping

Why does the ESXi report higher memory usage than I expect

Related Topic