Disk IOs per device (IOs/second)
With traditional hard drives this is a very important number. I/O operation is a read or write operation to disk. With rotational spindles you can get around from dozens to perhaps 200 IOPS per second, depending on the disk speed and its usage pattern.
This is not all to it: modern operating systems do have I/O schedulers which try to merge several I/O requests as one and make things faster that way. Also the RAID controllers and so on do perform some smart I/O request reordering.
Disk latency per device (Average IO wait)
How long it took from performing the I/O request to an individual disk to actually receive the data from there. If this hovers around couple of milliseconds, you are OK, if it's dozens of ms, then you are starting to see your disk subsystem sweating, if it's hundreds of more ms, you are in big trouble, or at least have a very, very slow system.
IO Service Time
How your disk subsystem (possibly containing lots of disks) is performing overall.
IOStat (blocks/second read/written)
How many disk blocks were read/written per second. Look for spikes and also the average. If average starts to near the maximum throughput of your disk subsystem, it's time to plan for performance upgrade. Actually, plan that way before that point.
Available entropy (bytes)
Some applications do want to get "true" random data. Kernel gathers that 'true' randomness from several sources, such as keyboard and mouse activity, a random number generator found in many motherboards, or even from video/music files (video-entropyd and audio-entropyd can do that).
If your system runs out of entropy, the applications wanting that data stall until they get their data. Personally in the past I've seen this happening with Cyrus IMAP daemon and its POP3 service; it generated a long random string before each login, and on a busy server that consumed the entropy pool very quickly.
One way to get rid of that problem is to switch the applications to use only semi-random data (/dev/urandom), but that's not among this topic anymore.
VMStat (running/I/O sleep processes)
Not thought about this one before, but I would think that this tells you about per-process I/O statistics, or mainly if they are running some I/O or not, and if that I/O is blocking I/O activity or not.
Disk throughput per device (bytes/second read/written)
This is purely bytes read/written per second, and more often this is more human-readable form than blocks, which may vary. Block size may differ because of the disks used, file system (and its settings) used, and so on. Sometimes the block size might be 512 bytes, other times 4096 bytes, sometimes something else.
inode table usage
With file systems having dynamic inodes (such as XFS), nothing. With file systems having static inodes maps (such as ext3), everything. If you have combination of static inodes, a huge file system and huge number of directories and small files, you might encounter a situation where you cannot create more files on that partition, even though in theory there would be lots of free space left. No free inodes == bad.
This happens because RSS is not an authoritative value that tells you how much of that memory being used by that program. Its an authoritative value as to how much resident memory is mapped by that program. And there is a difference.
RSS can at best be only used as a hint to how much memory you are utilizing.
The kernel has a lot of tricks up its sleeve to save memory. Processes may share lots of memory, especially processes that fork.
If you have a parent that allocates 100M of memory, then spawns a child both of these processes will share that area of memory, both the parent and the child will claim to have an RSS value of >= 100M, because they both map to the same region of memory. Technically this is correct, the RSS for the parent process is >= 100M as thats how much memory the process has mapped, and the child process also has RSS >= 100M because that process also has that much mapped, just it happens to be that both processes share (mostly) the same mappings.
You can demonstrate this with some simple python.
#!/usr/bin/python
import os,sys,signal
HOG = 'A' * 104857600 ## 100 MB
try:
for i in range(100):
pid = os.fork()
if pid:
continue
else:
break
signal.pause()
except KeyboardInterrupt:
sys.exit(0)
This program creates a 100M area of memory, and fills it with 'A's. It then spawns 100 children (101 total processes) then waits for a ctrl-c.
This is the scenario before.
$ top -bn1 -u matthew
top - 21:03:04 up 11 min, 1 user, load average: 0.04, 0.08, 0.09
Tasks: 212 total, 1 running, 211 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.7 us, 0.3 sy, 0.0 ni, 98.7 id, 0.2 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 16124248 total, 1513728 used, 14610520 free, 78268 buffers
KiB Swap: 8069116 total, 0 used, 8069116 free, 578148 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1837 matthew 20 0 767916 5072 3400 S 0.0 0.0 0:00.06 gnome-keyr+
1880 matthew 20 0 13920 608 468 S 0.0 0.0 0:00.00 dbus-launch
1949 matthew 20 0 307180 2804 2312 S 0.0 0.0 0:00.01 gvfsd
2051 matthew 20 0 337684 2908 2436 S 0.0 0.0 0:00.00 at-spi-bus+
2059 matthew 20 0 127260 2920 2360 S 0.0 0.0 0:00.05 at-spi2-re+
2082 matthew 9 -11 486316 7044 4376 S 0.0 0.0 0:00.09 pulseaudio
2121 matthew 20 0 317660 2952 2324 S 0.0 0.0 0:00.00 gvfs-gphot+
2132 matthew 20 0 1440732 105732 30156 S 0.0 0.7 0:09.64 gnome-shell
2145 matthew 20 0 513076 3996 3064 S 0.0 0.0 0:00.00 gsd-printer
2160 matthew 20 0 313300 3488 2940 S 0.0 0.0 0:00.00 ibus-dconf
2172 matthew 20 0 775428 14000 10348 S 0.0 0.1 0:00.05 gnome-shel+
2182 matthew 20 0 319120 7120 5444 S 0.0 0.0 0:00.07 mission-co+
2196 matthew 20 0 232848 2708 2164 S 0.0 0.0 0:00.00 gvfsd-meta+
2206 matthew 20 0 408000 11828 8084 S 0.0 0.1 0:00.06 abrt-applet
2209 matthew 20 0 761072 15120 10680 S 0.0 0.1 0:00.13 nm-applet
2216 matthew 20 0 873088 14956 10600 S 0.0 0.1 0:00.09 evolution-+
2224 matthew 20 0 1357640 29248 14052 S 0.0 0.2 0:00.26 evolution-+
2403 matthew 20 0 295036 6680 3876 S 0.0 0.0 0:00.01 telepathy-+
2475 matthew 20 0 380916 2756 2264 S 0.0 0.0 0:00.00 gvfsd-burn
2486 matthew 20 0 8460 736 608 S 0.0 0.0 0:00.00 gnome-pty-+
2617 matthew 20 0 116412 3068 1596 S 0.0 0.0 0:00.04 bash
2888 matthew 20 0 457196 9868 5164 S 0.0 0.1 0:00.05 telepathy-+
3347 matthew 20 0 123648 1400 1020 R 0.0 0.0 0:00.00 top
Top shows 14610520 KB free memory.
Lets run our program:
$ python trick_rss.py & top -bn1 -u matthew
[2] 3465
top - 21:04:54 up 13 min, 1 user, load average: 0.05, 0.07, 0.08
Tasks: 415 total, 1 running, 414 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.7 us, 0.3 sy, 0.0 ni, 98.8 id, 0.2 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 16124248 total, 1832040 used, 14292208 free, 78320 buffers
KiB Swap: 8069116 total, 0 used, 8069116 free, 578144 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3465 matthew 20 0 227652 106676 1792 S 31.7 0.7 0:00.05 python
2483 matthew 20 0 641568 18736 11656 S 6.3 0.1 0:01.26 gnome-term+
1837 matthew 20 0 767916 5072 3400 S 0.0 0.0 0:00.06 gnome-keyr+
1880 matthew 20 0 13920 608 468 S 0.0 0.0 0:00.00 dbus-launch
1949 matthew 20 0 307180 2804 2312 S 0.0 0.0 0:00.01 gvfsd
2051 matthew 20 0 337684 2908 2436 S 0.0 0.0 0:00.00 at-spi-bus+
2059 matthew 20 0 127260 2920 2360 S 0.0 0.0 0:00.05 at-spi2-re+
2082 matthew 9 -11 486316 7044 4376 S 0.0 0.0 0:00.09 pulseaudio
2121 matthew 20 0 317660 2952 2324 S 0.0 0.0 0:00.00 gvfs-gphot+
2136 matthew 20 0 178692 2588 1788 S 0.0 0.0 0:00.00 dconf-serv+
2145 matthew 20 0 513076 3996 3064 S 0.0 0.0 0:00.00 gsd-printer
2160 matthew 20 0 313300 3488 2940 S 0.0 0.0 0:00.00 ibus-dconf
2172 matthew 20 0 775428 14000 10348 S 0.0 0.1 0:00.05 gnome-shel+
2182 matthew 20 0 319120 7120 5444 S 0.0 0.0 0:00.07 mission-co+
2196 matthew 20 0 232848 2708 2164 S 0.0 0.0 0:00.00 gvfsd-meta+
2206 matthew 20 0 408000 11828 8084 S 0.0 0.1 0:00.06 abrt-applet
2209 matthew 20 0 761072 15120 10680 S 0.0 0.1 0:00.14 nm-applet
2216 matthew 20 0 873088 14956 10600 S 0.0 0.1 0:00.10 evolution-+
2224 matthew 20 0 1357640 29248 14052 S 0.0 0.2 0:00.26 evolution-+
2403 matthew 20 0 295036 6680 3876 S 0.0 0.0 0:00.01 telepathy-+
2475 matthew 20 0 380916 2756 2264 S 0.0 0.0 0:00.00 gvfsd-burn
2487 matthew 20 0 116544 3316 1716 S 0.0 0.0 0:00.09 bash
2804 matthew 20 0 1239196 275576 41432 S 0.0 1.7 0:25.54 firefox
2890 matthew 20 0 436688 15932 7288 S 0.0 0.1 0:00.05 telepathy-+
3360 matthew 20 0 227652 106680 1792 S 0.0 0.7 0:00.05 python
3366 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3368 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3370 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3372 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3374 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3376 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3378 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3380 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3382 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3384 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3386 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3388 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3390 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3392 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3394 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3396 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3398 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3400 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3402 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3404 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3406 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3408 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3410 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3412 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3414 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3416 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3418 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3420 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3422 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3424 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3426 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3428 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3430 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3432 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3434 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3436 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3438 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3440 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3442 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3444 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3446 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3448 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3450 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3452 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3454 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3456 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3458 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3460 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3462 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3464 matthew 20 0 227652 105096 208 S 0.0 0.7 0:00.00 python
3467 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3469 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3471 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3473 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3475 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3477 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3479 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3481 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3483 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3485 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3487 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3489 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3491 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3493 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3495 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3497 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3499 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3501 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3503 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3505 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3507 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3509 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3511 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3513 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3515 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3517 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3519 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3521 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3523 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3525 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3527 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3529 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3531 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3533 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3535 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3537 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3539 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3541 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3543 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3545 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3547 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3549 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3551 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3553 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3555 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3557 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3559 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3561 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3563 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
3565 matthew 20 0 227652 105092 208 S 0.0 0.7 0:00.00 python
I have 14292208 Kb free. About 300M of memory has been used up. But, if I go by what RSS is telling me I've actually used 10GB of memory!
Finally, if you take a look at the process mappings, you can see that the virtual memory addresses are the same as one another.
$ pmap -x 3561
...
00007f05da5e8000 102404 102404 102404 rw--- [ anon ]
...
$ pmap -x 3565
...
00007f05da5e8000 102404 102404 102404 rw--- [ anon ]
...
Lazy Copying
This C program demonstrates lazy copying occurring, in this scenario, all processes map to the same region of memory, but the children have overwritten the contents. In the background the kernel has remapped these pages to different locations in real memory, but show the same virtual address space.
Now, each instance actually does take up memory but the RSS value remains constant.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <string.h>
int main() {
int i;
char c=65;
pid_t pid;
signal(SIGCHLD, SIG_IGN);
/* Allocate some memory */
char *hog = malloc(104857600);
memset(hog, c, 104857600);
for (i=1; i < 4; i++) {
if (fork())
continue;
memset(hog, c+i, 104857600);
break;
}
sleep(3);
printf("Pid %d shows HOG[1048576] saying %c\n", getpid(), hog[1048576]);
pause();
}
Compile with gcc -o trick_rss trick_rss.c
. And run with free -m; ./trick_rss & sleep 5; free -m
.
You get the following result;
$ free -m; ./trick_rss & sleep 5; free -m
total used free shared buffers cached
Mem: 15746 2477 13268 0 79 589
-/+ buffers/cache: 1808 13938
Swap: 7879 0 7879
[3] 4422
Pid 4422 shows HOG[1048576] saying A
Pid 4424 shows HOG[1048576] saying B
Pid 4425 shows HOG[1048576] saying C
Pid 4426 shows HOG[1048576] saying D
total used free shared buffers cached
Mem: 15746 2878 12867 0 79 589
-/+ buffers/cache: 2209 13536
Swap: 7879 0 7879
Best Answer
Try this out on a regular basis to see how process counts go up and down for a "certain" named process. It disregards PID and just looks at the end of the line beyond the cpu time.
This works on a RHEL box. You might put it in cron after getting a baseline of what the starting process list looks like.