So we have set up a server(11.0-RELEASE-p2) that hosts around 150-200 jails. The server has 24 cores and 192gb of ram. When using top it shows no sign of stress – except the high load. All jails reside on NFS mounts and each jail mounts its own directory upon creation.
The server does not feel slow in any way, its rather snappy. The one thing that bothers us is the high load we get.
Output from top:
last pid: 71841; load averages: 320.13, 131.33, 79.28 up 27+17:45:03 10:37:48
5325 processes:1 running, 5324 sleeping
CPU: 4.4% user, 0.0% nice, 1.6% system, 0.4% interrupt, 93.6% idle
Mem: 3116M Active, 23G Inact, 23G Wired, 900M Buf, 138G Free
ARC: 10G Total, 2612M MFU, 4553M MRU, 37M Anon, 89M Header, 2742M Other
Swap: 4096M Total, 4096M Free
As you can see, the load is high, memory has 138G free and cpu is 94% idle.
Output from systat -vmstat
3 users Load 92.59 105 73.97 Feb 1 10:39
Mem usage: 26%Phy 6%Kmem
Mem: KB REAL VIRTUAL VN PAGER SWAP PAGER
Tot Share Tot Share Free in out in out
Act 21491k 223884 120800k 555864 144668k count
All 22230k 836948 142997k 4351592 pages
Proc: Interrupts
r p d s w Csw Trp Sys Int Sof Flt ioflt 3595 total
104 5k 13k 5848 20k 1362 127 1646 147 cow atkbd0 1
730 zfod 1 ata1 15
1.8%Sys 0.3%Intr 3.0%User 0.0%Nice 94.9%Idle ozfod ohci0 ohci
| | | | | | | | | | %ozfod ehci0 ohci
=>> daefr 107 cpu0:timer
dtbuf 622 prcfr 722 bce0 259
Namei Name-cache Dir-cache 3237762 desvn 2014 totfr 619 bce1 260
Calls hits % hits % 3237760 numvn react pcib7 263
41265 41201 100 2713450 frevn pdwak 21 mps0 264
1290 pdpgs ciss0 265
Disks da0 da1 cd0 pass0 pass1 pass2 intrn 74 cpu13:time
KB/t 13.33 14.76 0.00 0.00 0.00 0.00 24315624 wire 112 cpu4:timer
tps 10 17 0 0 0 0 3192008 act 147 cpu2:timer
MB/s 0.14 0.24 0.00 0.00 0.00 0.00 23921440 inact 54 cpu3:timer
%busy 0 0 0 0 0 0 cache 132 cpu5:timer
144669k free 52 cpu1:timer
921954 68 cpu19:time
99 cpu21:time
54 cpu20:time
59 cpu18:time
59 cpu22:time
82 cpu23:time
67 cpu12:time
68 cpu6:timer
79 cpu14:time
88 cpu15:time
111 cpu16:time
93 cpu17:time
49 cpu8:timer
251 cpu7:timer
102 cpu9:timer
176 cpu10:time
49 cpu11:time
As far as i can tell nothing looks really strange there either. Sure, there are some interrupts but googling shows that interrupts in the amount we get there is nothing compared to what other people get when they have interrupt problems which are more in the line of 350 000 interrupts.
iostat -w 1
tty da0 da1 cd0 cpu
tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id
1 571 14.51 11 0.15 14.56 11 0.15 0.00 0 0.00 1 0 1 0 99
0 231 10.29 90 0.90 11.26 102 1.12 0.00 0 0.00 3 0 1 0 95
0 78 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 3 0 1 0 96
0 78 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 7 0 1 0 92
0 79 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 3 0 2 0 95
0 78 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 6 0 2 0 93
0 77 13.63 128 1.71 11.97 123 1.44 0.00 0 0.00 2 0 2 0 96
0 79 36.00 1 0.04 14.86 7 0.10 0.00 0 0.00 2 0 1 0 97
0 78 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 4 0 2 0 94
0 76 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 4 0 2 0 94
0 80 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 2 0 1 0 97
0 75 9.98 117 1.15 18.43 129 2.32 0.00 0 0.00 3 0 1 0 96
0 81 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 4 0 2 0 94
0 78 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 2 0 1 0 96
vmstat -w 1
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr da0 da1 in sy cs us sy id
3 0 0 115G 138G 297 0 2 0 653 373 0 0 224 59 1405 1 1 99
2 0 0 115G 138G 75 0 0 0 2017 1368 118 109 2299 23370 18920 6 2 92
2 0 0 115G 138G 1397 0 2 0 2839 1434 0 0 2665 30985 23294 5 4 91
2 0 0 115G 138G 1113 0 0 0 666 1373 0 0 2222 23078 17157 5 2 93
1 0 0 115G 138G 7 0 0 0 597 1368 0 0 590 18529 10477 2 1 96
1 0 0 115G 138G 0 0 2 0 194 2773 83 81 1269 26734 19190 3 3 94
1 0 0 115G 138G 9 0 0 0 90 1404 0 0 833 18907 11455 2 2 96
2 0 0 115G 138G 13 0 0 0 1309 1374 0 0 3185 25773 20054 3 3 94
1 0 0 115G 138G 1419 0 0 0 2750 1369 0 0 3899 25403 23252 7 4 90
0 0 0 115G 138G 776 0 1 0 164 1368 75 58 837 26261 16368 3 3 94
1 0 0 115G 138G 2336 0 5 0 2562 1367 0 0 1337 23287 13288 3 3 94
0 0 0 115G 138G 560 0 0 0 1193 2785 0 0 608 27176 14512 5 5 90
1 0 0 115G 138G 0 0 2 0 249 1369 0 0 702 18533 10700 1 2 97
1 0 0 115G 138G 3290 0 0 0 2313 1369 91 96 1461 22049 14726 6 3 91
About NFS i really dont know how to look for problems there. But here is a output from
nfsstat -c
Client Info:
Rpc Counts:
Getattr Setattr Lookup Readlink Read Write Create Remove
44956931 1020943 93567574 167 23609403 879028 514647 665228
Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access
36867 1387 1 24655 21955 6118822 0 26166205
Mknod Fsstat Fsinfo PathConf Commit
0 5489407 1 2270 830867
Rpc Info:
TimedOut Invalid X Replies Retries Requests
0 0 0 0 203906224
Cache Info:
Attr Hits Misses Lkup Hits Misses BioR Hits Misses BioW Hits Misses
-719986429 44956925 -1243965171 93531884 66678251 22460288 981123 879028
BioRLHits Misses BioD Hits Misses DirE Hits Misses Accs Hits Misses
144 167 14572148 5721030 5124486 1455 -1123294109 26165764
and from
nfsstat -w 1 -c
GtAttr Lookup Rdlink Read Write Rename Access Rddir
5 0 0 5 0 0 0 2
9 342 0 9 0 0 42 9
12 91 0 21 0 0 21 4
0 2 0 0 0 0 2 0
0 1 0 0 0 0 0 0
0 5 0 0 0 0 2 0
5 124 0 5 0 0 0 2
6 12 0 5 0 0 12 2
4 0 0 5 0 0 0 2
9 0 0 10 0 0 0 4
4 0 0 5 0 0 0 2
50 1 0 14 0 0 0 7
and finally output from
systat -ifstat
/0 /1 /2 /3 /4 /5 /6 /7 /8 /9 /10
Load Average <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 29.6
Interface Traffic Peak Total
lo0 in 34.285 KB/s 291.936 KB/s 69.263 GB
out 34.285 KB/s 291.936 KB/s 69.263 GB
bce1 in 792.808 KB/s 5.382 MB/s 707.266 GB
out 56.828 KB/s 238.912 KB/s 91.154 GB
bce0 in 21.711 KB/s 21.711 KB/s 17.338 GB
out 13.799 KB/s 287.402 KB/s 64.000 GB
As requested dmesg:
[larsemil@prison01 ~]$ dmesg
Limiting open port RST response from 213 to 200 packets/sec
Limiting open port RST response from 2636 to 200 packets/sec
pid 22548 (php-fpm), uid 10000: exited on signal 11
pid 26938 (wkhtmltopdf), uid 10000: exited on signal 6 (core dumped)
[zone: pf states] PF states limit reached
Limiting icmp ping response from 9592 to 200 packets/sec
Limiting icmp ping response from 611 to 200 packets/sec
Limiting icmp ping response from 1792 to 200 packets/sec
Limiting icmp ping response from 2650 to 200 packets/sec
Limiting icmp ping response from 316 to 200 packets/sec
Limiting icmp ping response from 1758 to 200 packets/sec
Limiting icmp ping response from 2478 to 200 packets/sec
Limiting icmp ping response from 578 to 200 packets/sec
Limiting icmp ping response from 2028 to 200 packets/sec
Limiting icmp ping response from 3175 to 200 packets/sec
Limiting icmp ping response from 245 to 200 packets/sec
Limiting icmp ping response from 536 to 200 packets/sec
Limiting icmp ping response from 229 to 200 packets/sec
Limiting icmp ping response from 546 to 200 packets/sec
Limiting icmp ping response from 2239 to 200 packets/sec
Limiting icmp ping response from 3414 to 200 packets/sec
Limiting icmp ping response from 3033 to 200 packets/sec
Limiting icmp ping response from 1018 to 200 packets/sec
Limiting icmp ping response from 270 to 200 packets/sec
pid 34239 (php-fpm), uid 10000: exited on signal 11
pid 68427 (php-fpm), uid 10000: exited on signal 11
Any ideas are welcome!
Best Answer
Can you post dmesg output and any log messages from /var/log/messages?
What I see is that you have a 196GB ram machine that is trying to do everything in 3GB of ram... it is probably swapping furiously.
Mem: 3116M Active, 23G Inact, 23G Wired, 900M Buf, 138G Free ARC: 10G Total, 2612M MFU, 4553M MRU, 37M Anon, 89M Header, 2742M Other
Free ram is bad. You need to use the ram in the machine. Please post the output of sysctl vfs.zfs.arc_max Check here for zfs tuning for the ARC
Jails themselves do basically nothing. Processes in the jails will show up in top if they are running - looks like not much is going on.
FreeBSD top is different yes, the LA should be read relative to the number of cores (24). Your LA is high, but this is only because something cannot get the memory it needs.