(VMware) Will a Linux guest VM swap when RAM usage is high even though Active is not

memory

We are currently having a discussion in my environment. The assumptions are that

“The tools that are designed for a physical infrastructure memory management in Linux such as top, free, vmstat, /proc/meminfo don’t work on a VM in an ESXi hypervisor because they rely on the OS’s ability to monitor the physical RAM directly in a bare metal system. In a virtual infrastructure, the guest OS can't see the physical hardware, only the virtual hardware that has been emulated by the (ESXi) hypervisor. Because an OS can't directly access most of the host server's physical hardware, traditional performance monitoring tools won't function properly in a virtual infrastructure.”

In the VM guest (SLES 11) we’re seeing RAM utilization at around 93%. However in the vSphere/vCops is showing 5072309 kb Active Write of 8388608 (8GB) KB of allocated RAM. Roughly about 60.46 % utilization rate.

The question, is the above assumption correct and if so:

If top inside the guest is inaccurate in the VM and it reaches 100% RAM utilization does the VM guest then swap and if so I presume with the vswp which is located on the SAN and cause a slowdown on the VM Guest?

Best Answer

It depends. You should rely on the in-VM tools like top, vmstat, etc. They are accurate, assuming that your physical resources aren't too overcommitted and you've installed the VMware tools. At the vSphere level, you still have memory ballooning, TPS, compression and swap as fallbacks. The memory management really isn't that bad.

Also understand that vSphere/vCenter metrics and vCops metrics are totally different (vCops was acquired by VMware and uses its own algorithms to measure resource utilization).

Measure utilization at the VM level. Plan cluster-level resources using vCops or vCenter.

Don't overcommit your RAM too much because that WILL skew metrics and have a tangible performance impact.

They type of application also matters. If it's heavy Java-based, some other approaches (partial or full reservations) are necessary to help them run effectively in a vSphere environment.

Related Solutions

AIX swap usage shows ~20%, not even close to maximum load

What are your vmm settings settings ?

Make sure you have lru_file_repage=1.

vmo -o lru_file_repage

The default setting (lru_file_repage=0) has the effect that AIX pages out application pages, even if there are plenty of cached files around where pages could be thrown away. One effect is a growing paging space use.

The setting lru_file_repage=0 is the default up and including AIX 5.3

VSphere education – What are the downsides of configuring VMs with too much RAM

vSphere's memory management is pretty decent, though the terms used often cause a lot of confusion.

In general, memory over-commit should be avoided as it creates exactly this type of problem. However, there are times when it cannot be avoided, so forewarned is forearmed!

What are the downsides of overcommitting and over-configuring resources (specifically RAM) in vSphere environments?

The major downside of over-committing resources is that should you have contention, your hosts would be forced to balloon, swap or intelligently schedule/de-duplicate behind the scenes in order to give each VM the RAM it needs.

For ballooning, vSphere will inflate a "balloon" of RAM within a chosen VM, then give that ballooned RAM to the guest that needs it. This isn't really "bad" - VMs are stealing each other's RAM, so there's no disk swapping going on - but it could lead to mis-fired alerting and skewed metrics if these rely on analysing the VM's RAM usage, as the RAM won't be marked as "ballooned", just that it's "in use" by the OS.

The other feature that vSphere can use is Transparent Page Sharing (TPS) - which is essentially RAM de-duplication. vSphere will periodically scan all allocated RAM, looking for duplicated pages. When found, it will de-duplicate and free up the duplicated pages.

Take a look at vSphere's Memory Management whitepaper (PDF) - specifically "Memory Reclamation in ESXi" (page 8) - if you need a more in-depth explanation.

Assuming that the VMs can run in less RAM, is it fair to say that there's overhead to configuring virtual machines with more RAM than they need?

There's no visible overhead - you can allocate 100GB of RAM on a host with 16 GB (however, that doesn't mean you should, for the reasons above).

Total memory in use by all of your VMs is the "Active" curve shown in your graphs. Of course, you should never rely on just that figure when calculating how much you would like to overcommit, but if you have historical metrics as you have, you can analyse and work it out based on actual usage.

The difference between "Active" and "Consumed" RAM is discussed in this VMWare Community thread.

What is the counter-argument to: "if a VM has 16GB of RAM allocated, but only uses 4GB, what's the problem??"? E.g. do customers need to be educated?

The short answer to this is yes - customers should always be educated in best practices, regardless of the the tools at their disposal.

Customers should be educated to size their VMs according to what they use, rather than what they want. A lot of the time, people will over-specify their VMs just because they might need 16 GB of RAM, even if they're historically bumbling along on 2 GB day after day. As a vSphere administrator, you have the knowledge, metrics and power to challenge them and ask them if they actually need the RAM they've allocated.

That said, if you combine vSphere's memory management with carefully-controlled overcommit limits, you should rarely have an issue in practice, the likelihood of running out of RAM for an extended period of time is relatively remote.

In addition to this, automated vMotion (called Distributed Resource Scheduling by VMware) is essentially a load-balancer for your VMs - if a single VM is becoming a resource hog, DRS should migrate VMs around to make best use of the cluster's resources.

What specific metric should be used to meter RAM usage. Tracking the peaks of "Active" versus time?

Mostly covered above - your main concern should be "Active" RAM usage, though you should carefully define your overcommit thresholds so that if you reach a certain ratio (this is a decent example, though it may be slightly outdated). Typically, I would certainly stay within 120% of total cluster RAM, but it's up to you to decide what ratio you're comfortable with.

A few good articles/discussions on memory over-commit:

Best Answer

Related Solutions

AIX swap usage shows ~20%, not even close to maximum load

VSphere education – What are the downsides of configuring VMs with *too* much RAM

Related Topic

VSphere education – What are the downsides of configuring VMs with too much RAM