**Note: Undeleting for posterity **
Your problem is here
# No alternate memory nodes if the system is not NUMA
# On computenodes use all available cores
cpuset {
cpuset.mems="0";
cpuset.cpus="0-47";
}
}
You are only ever using one node of memory. You need to set this to use all nodes of memory.
I also think the below applies too, and you'll see the problem still unless you know about the below. So leaving in for posterity.
This issue basically boils down to the hardware being used. The kernel has a heuristic to determine the value of this switch. This alters how the kernel determines memory pressure on a NUMA system.
zone_reclaim_mode:
Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.
This is value ORed together of
1 = Zone reclaim on
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages
zone_reclaim_mode is set during bootup to 1 if it is determined that pages
from remote zones will cause a measurable performance reduction. The
page allocator will then reclaim easily reusable pages (those page
cache pages that are currently not used) before allocating off node pages.
It may be beneficial to switch off zone reclaim if the system is
used for a file server and all of memory should be used for caching files
from disk. In that case the caching effect is more important than
data locality.
Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.
Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.
To give you some idea here of what is going on, memory is broken up into zones, this is specifically useful on NUMA systems which RAM is tied to specific CPUs. In these hosts memory locality can be an important factor in performance. If for example memory banks 1 and 2 are assigned to physical CPU 0, CPU 1 can access this but at the cost of locking that RAM from CPU 0, which causes a performance degradation.
On linux, the zoning reflects the NUMA layout of the physical machine. Each zone is 16GB in size.
What is happening at the moment with zone reclaim on is the kernel is opting to reclaim (write dirty pages to disk, evict file cache, swap out memory) in a full zone (16 GB) rather than permit the process to allocate memory in another zone (which will impact performance on that CPU. This is why you notice swapping after 16GB.
If you switch off this value this should alter the behaviour of the kernel not aggressively reclaim zone data and instead allocate from another node.
Try switching off zone_reclaim_mode
by running sysctl -w vm.zone_reclaim_mode=0
on your system and then re-running your test.
Note, long running high memory processes running on a configuration like this with zone_reclaim_mode
off will become increasingly expensive over time.
If you allow lots of disparate processes running on many different CPUs all using lots of memory to use any node with free pages, you can effectively render the performance of the host to something akin to it only having 1 physical CPU.
Configuration is little different from Ubuntu. You need to add CGROUP_DAEMON=sets:name
in /etc/sysconfig/libvirtd
.
in your case it is
CGROUP_DAEMON=memory:/mynamekvm
Restart all relavant services that is cgconfig,libvirt and guests. Also make sure selinux is configured properly or try disabling and then restart services.
Best Answer
Please try for example
Assuming you wish to limit the resource for UID 1000 and limit the CPUShares.