This question relates to the definition of specific metrics recorded in /proc/vmstat on RHEL 5.3.
I'm using nmon to monitor a load test which consists of simulating 2500 users carrying out a days workload in one hour. Recently, I've seen bad performance and am in the process of diagnosing and excluding various considerations.
We are running Red Hat Enterprise Linux Server release 5.3 (Tikanga) on vmware ESX. The physical server I'm focussing on runs and Oracle Application Server (this comprises and Apache HTTP server and OC4J J2EE container).
The nmon charts I am viewing are showing consistent allocation to the pswpin metric. Summarised as; min = 4312; max = 245352; avg = 86734. Nmon shows these values measured in "kBytes per second"
The following metrics are zero throughout the test;
- pswpout
- pgpgin
- pgpgout
I'm confused as to what this combination of metrics means, given my understanding of paging and swapping.
My Question(s):
- Can someone please confirm what these metrics represent?
- Any idea what system behaviour might cause this type of VM behaviour?
At the moment I'm trying to exclude virtual memory issues as a cause of poor performance.
EDIT: I have found evidence of a large number of fork() calls throughout the test. I suspect the Apache daemons. But could process creation be the cause of these metrics?
EDIT: I've added a typical sample of the VM output from nmon. Apologies for the poor formatting.
Thanks in advance for any responses.
T0001 -1 22 -1 -1 -1 150 -1 -1 -1 5196046163 -1 0 30 100199751 3060 -1 0 -1 885 -1 -1 -1 46163 -1 -1 18 -1 828189171 -1 -1 3838 -1 -1 -1 -1 -1 165231
03:07:23 Paging and Virtual Memory nr_dirty nr_writeback nr_unstable nr_page_table_pages nr_mapped nr_slab pgpgin pgpgout pswpin pswpout pgfree pgactivate pgdeactivate pgfault pgmajfault pginodesteal slabs_scanned kswapd_steal kswapd_inodesteal pageoutrun allocstall pgrotated pgalloc_high pgalloc_normal pgalloc_dma pgrefill_high pgrefill_normal pgrefill_dma pgsteal_high pgsteal_normal pgsteal_dma pgscan_kswapd_high pgscan_kswapd_normal pgscan_kswapd_dma pgscan_direct_high pgscan_direct_normal pgscan_direct_dma
03:07:33 -1 99 -1 -1 -1 241 0 0 0 77526 0 0 0 824 0 0 0 0 0 0 0 0 77526 0 0 0 0 0 0 0 78216 0 0 0 0 0 0
03:07:43 -1 10 -1 -1 -1 262 0 0 0 21653 0 0 8 500 2 0 0 0 0 0 0 0 21653 0 0 0 0 0 0 0 17675 0 0 0 0 0 0
03:07:53 -1 69 -1 -1 -1 257 0 0 0 115744 0 0 0 724 0 0 0 0 0 0 0 0 115744 0 0 0 0 0 0 0 -79544 0 0 0 0 0 0
03:08:03 -1 69 -1 -1 -1 196 0 0 0 81202 0 0 0 628 0 0 0 0 0 0 0 0 81202 0 0 0 0 0 0 0 -18335 0 0 0 0 0 0
03:08:13 -1 81 -1 -1 -1 205 0 0 0 29051 0 0 0 352 0 0 0 0 0 0 0 0 29051 0 0 0 0 0 0 0 24449 0 0 0 0 0 0
03:08:24 -1 91 -1 -1 -1 131 0 0 0 122795 0 0 0 1172 0 0 0 0 0 0 0 0 122795 0 0 0 0 0 0 0 9640 0 0 0 0 0 0
03:08:34 -1 6 -1 -1 -1 182 0 0 0 74914 0 0 4 372 1 0 0 0 0 0 0 0 74914 0 0 0 0 0 0 0 -24477 0 0 0 0 0 0
03:08:44 -1 38 -1 -1 -1 200 0 0 0 42957 0 0 4 464 1 0 0 0 0 0 0 0 42957 0 0 0 0 0 0 0 42778 0 0 0 0 0 0
03:08:54 -1 6 -1 -1 -1 141 0 0 0 89751 0 0 36 1000 9 0 0 0 0 0 0 0 89751 0 0 0 0 0 0 0 -9665 0 0 0 0 0 0
03:09:04 -1 6 -1 -1 -1 171 0 0 0 74740 0 0 4 516 1 0 0 0 0 0 0 0 74740 0 0 0 0 0 0 0 -24583 0 0 0 0 0 0
03:09:14 -1 10 -1 -1 -1 179 0 0 0 56063 0 0 0 500 0 0 0 0 0 0 0 0 56063 0 0 0 0 0 0 0 56384 0 0 0 0 0 0
03:09:24 -1 6 -1 -1 -1 74 0 0 0 75623 0 0 0 696 0 0 0 0 0 0 0 0 75623 0 0 0 0 0 0 0 -23994 0 0 0 0 0 0
03:09:34 -1 6 -1 -1 -1 137 0 0 0 75466 0 0 8 972 2 0 0 0 0 0 0 0 75466 0 0 0 0 0 0 0 -23837 0 0 0 0 0 0
03:09:44 -1 3 -1 -1 -1 153 0 0 0 72535 0 0 4 460 1 0 0 0 0 0 0 0 -927465 0 0 0 0 0 0 0 -26880 0 0 0 0 0 0
03:09:54 -1 6 -1 -1 -1 170 0 0 0 56775 0 0 0 284 0 0 0 0 0 0 0 0 56775 0 0 0 0 0 0 0 56895 0 0 0 0 0 0
03:10:04 -1 6 -1 -1 -1 166 0 0 0 74756 0 0 0 1116 0 0 0 0 0 0 0 0 74756 0 0 0 0 0 0 0 -24568 0 0 0 0 0 0
03:10:14 -1 6 -1 -1 -1 148 0 0 0 78043 0 0 0 432 0 0 0 0 0 0 0 0 78043 0 0 0 0 0 0 0 -21241 0 0 0 0 0 0
03:10:24 -1 64 -1 -1 -1 189 0 0 0 64057 0 0 0 412 0 0 0 0 0 0 0 0 64057 0 0 0 0 0 0 0 60788 0 0 0 0 0 0
Best Answer
I am 87% sure, that each page that increased pswpin counter should also increase pgpgin. You say it isn't . Hmmm.
This may be too simplistic thing to check (sorry!) but... Are you 200% sure, that the metric you observe is pswpin, not pgpgin? The later would translate to: process is reading some files.Other explanation is that application has been heavily swapped out before the test, then the system obtained a lot of free memory. And during the test you are observing as it is "coming back to life" (constantly swapping itself in - as the code execution progresses), without reading/writing any files. But why in such scenario isn't the pgpgin increased along pswpin is beyond my comprehension.
Maybe your charts are tweaked, so pswpin is substracted from pgpgin? One point to back this up is that both metrics are typically in pages (in /proc/vmstat), and you have them converted to KB/s.
EDIT: This might be ESX-related. My wild guess that it is a side effect of either balooning or transparent page sharing (TPS). Are you able to analyze via esxtop on the ESX? Here is another esxtop guide.EDIT: Your nmon stats seem broken. First of all, there are more column names than actual metrics (i.e. you don't have data for last column
pgscan_direct_dma
). There are a lot -1 or 0 values on metrics that should be there on a busy system, not only pgpgin is missing. Pgsteal and pgrotated are there, but sometimes negative, which is not possible.So, look at /proc/vmstat, what's going on there? And use other tools to confirm nmon stats.