(1) I see that each of the running processes occupies a very small percentage of memory (%MEM no more than 0.2%, and most just 0.0%), but how the total memory is almost used as in the fourth line of output ("Mem: 130766620k total, 130161072k used, 605548k free, 919300k buffers")? The sum of used percentage of memory over all processes seems unlikely to achieve almost 100%, doesn't it?
To see how much memory you are currently using, run free -m
. It will provide output like:
total used free shared buffers cached
Mem: 2012 1923 88 0 91 515
-/+ buffers/cache: 1316 695
Swap: 3153 256 2896
The top row 'used' (1923) value will almost always nearly match the top row mem value (2012). Since Linux likes to use any spare memory to cache disk blocks (515).
The key used figure to look at is the buffers/cache row used value (1316). This is how much space your applications are currently using. For best performance, this number should be less than your total (2012) memory. To prevent out of memory errors, it needs to be less than the total memory (2012) and swap space (3153).
If you wish to quickly see how much memory is free look at the buffers/cache row free value (695). This is the total memory (2012)- the actual used (1316). (2012 - 1316 = 696, not 695, this will just be a rounding issue)
(2) how to understand the load average on the first line ("load average: 14.04, 14.02, 14.00")?
This article on load average uses a nice traffic analogy and is the best one I've found so far: Understanding Linux CPU Load - when should you be worried?. In your case, as people pointed out:
On multi-processor system, the load is relative to the number of processor cores available. The "100% utilization" mark is 1.00 on a single-core system, 2.00, on a dual-core, 4.00 on a quad-core, etc.
So, with a load average of 14.00 and 24 cores, your server is far from being overloaded.
Hm, complex problem :-/. One point:
I think it might well be the case that our firewall (as stated earlier only capable of handling 100 Mbit) could be the real reason for the high load average on the server and thus even adding more servers wouldn't help us much.
If the firewall throttled data transfers, you would not see high CPU load (%user above); rather you would see higher %iowait (as that includes network I/O). So that seems unlikely (unless the app does some kind of polling).
I think your best course of action is to examine more closely the high %user on the app servers; do some kind of profiling to find out what exactly the app does when it loads the CPU. That should give you a clue.
The firewall might also be worth looking into, as you are getting close to its capacity.
Best Answer
There are several options:
atop is available via the EPEL repo for CentOS/RHEL/Fedora and via the default repos of Debian/Ubuntu.
You can use atop like a normal real-time top utility, with slightly different behaviour (check out the manpage for keystrokes).
The more interesting part is: Once installed a daemon starts logging data into /var/log/atop and you can read these files with atop again:
You have then access to all 'top' like functions (sorting/looking at memory/CPU/IO usage, etc.) and you can jump 10 minutes forward in time via 't' and 10 minutes back with 'T' or jump at a specific time via 'b'.
Check out the atop manpage and google has lots of howtos about it.
There might be other solutions, but atop is easy to understand and use and a good start before doing some more bespoke setups.