FreeBSD shows high load, cannot find bottleneck

So we have set up a server(11.0-RELEASE-p2) that hosts around 150-200 jails. The server has 24 cores and 192gb of ram. When using top it shows no sign of stress – except the high load. All jails reside on NFS mounts and each jail mounts its own directory upon creation.
The server does not feel slow in any way, its rather snappy. The one thing that bothers us is the high load we get.

Output from top:

last pid: 71841;  load averages: 320.13, 131.33, 79.28 up 27+17:45:03  10:37:48
5325 processes:1 running, 5324 sleeping
CPU:  4.4% user,  0.0% nice,  1.6% system,  0.4% interrupt, 93.6% idle
Mem: 3116M Active, 23G Inact, 23G Wired, 900M Buf, 138G Free
ARC: 10G Total, 2612M MFU, 4553M MRU, 37M Anon, 89M Header, 2742M Other
Swap: 4096M Total, 4096M Free

As you can see, the load is high, memory has 138G free and cpu is 94% idle.

Output from systat -vmstat

  3 users    Load 92.59   105 73.97                  Feb  1 10:39
   Mem usage:  26%Phy  6%Kmem
Mem: KB    REAL            VIRTUAL                      VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act  21491k  223884  120800k   555864 144668k  count
All  22230k  836948  142997k  4351592          pages
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt  3595 total
104          5k       13k 5848  20k 1362  127 1646    147 cow         atkbd0 1
                                                      730 zfod      1 ata1 15
 1.8%Sys   0.3%Intr  3.0%User  0.0%Nice 94.9%Idle         ozfod       ohci0 ohci
|    |    |    |    |    |    |    |    |    |           %ozfod       ehci0 ohci
=>>                                                       daefr   107 cpu0:timer
                                           dtbuf      622 prcfr   722 bce0 259
Namei     Name-cache   Dir-cache   3237762 desvn     2014 totfr   619 bce1 260
   Calls    hits   %    hits   %   3237760 numvn          react       pcib7 263
   41265   41201 100               2713450 frevn          pdwak    21 mps0 264
                                                     1290 pdpgs       ciss0 265
Disks   da0   da1   cd0 pass0 pass1 pass2                 intrn    74 cpu13:time
KB/t  13.33 14.76  0.00  0.00  0.00  0.00        24315624 wire    112 cpu4:timer
tps      10    17     0     0     0     0         3192008 act     147 cpu2:timer
MB/s   0.14  0.24  0.00  0.00  0.00  0.00        23921440 inact    54 cpu3:timer
%busy     0     0     0     0     0     0                 cache   132 cpu5:timer
                                                  144669k free     52 cpu1:timer
                                                   921954          68 cpu19:time
                                                                   99 cpu21:time
                                                                   54 cpu20:time
                                                                   59 cpu18:time
                                                                   59 cpu22:time
                                                                   82 cpu23:time
                                                                   67 cpu12:time
                                                                   68 cpu6:timer
                                                                   79 cpu14:time
                                                                   88 cpu15:time
                                                                  111 cpu16:time
                                                                   93 cpu17:time
                                                                   49 cpu8:timer
                                                                  251 cpu7:timer
                                                                  102 cpu9:timer
                                                                  176 cpu10:time
                                                                   49 cpu11:time

As far as i can tell nothing looks really strange there either. Sure, there are some interrupts but googling shows that interrupts in the amount we get there is nothing compared to what other people get when they have interrupt problems which are more in the line of 350 000 interrupts.

iostat -w 1

      tty             da0              da1              cd0             cpu
 tin  tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id
   1   571 14.51  11  0.15  14.56  11  0.15   0.00   0  0.00   1  0  1  0 99
   0   231 10.29  90  0.90  11.26 102  1.12   0.00   0  0.00   3  0  1  0 95
   0    78  0.00   0  0.00   0.00   0  0.00   0.00   0  0.00   3  0  1  0 96
   0    78  0.00   0  0.00   0.00   0  0.00   0.00   0  0.00   7  0  1  0 92
   0    79  0.00   0  0.00   0.00   0  0.00   0.00   0  0.00   3  0  2  0 95
   0    78  0.00   0  0.00   0.00   0  0.00   0.00   0  0.00   6  0  2  0 93
   0    77 13.63 128  1.71  11.97 123  1.44   0.00   0  0.00   2  0  2  0 96
   0    79 36.00   1  0.04  14.86   7  0.10   0.00   0  0.00   2  0  1  0 97
   0    78  0.00   0  0.00   0.00   0  0.00   0.00   0  0.00   4  0  2  0 94
   0    76  0.00   0  0.00   0.00   0  0.00   0.00   0  0.00   4  0  2  0 94
   0    80  0.00   0  0.00   0.00   0  0.00   0.00   0  0.00   2  0  1  0 97
   0    75  9.98 117  1.15  18.43 129  2.32   0.00   0  0.00   3  0  1  0 96
   0    81  0.00   0  0.00   0.00   0  0.00   0.00   0  0.00   4  0  2  0 94
   0    78  0.00   0  0.00   0.00   0  0.00   0.00   0  0.00   2  0  1  0 96

vmstat -w 1

procs  memory       page                    disks     faults         cpu
r b w  avm   fre   flt  re  pi  po    fr   sr da0 da1   in    sy    cs us sy id
3 0 0 115G  138G   297   0   2   0   653  373   0   0  224    59  1405  1  1 99
2 0 0 115G  138G    75   0   0   0  2017 1368 118 109 2299 23370 18920  6  2 92
2 0 0 115G  138G  1397   0   2   0  2839 1434   0   0 2665 30985 23294  5  4 91
2 0 0 115G  138G  1113   0   0   0   666 1373   0   0 2222 23078 17157  5  2 93
1 0 0 115G  138G     7   0   0   0   597 1368   0   0  590 18529 10477  2  1 96
1 0 0 115G  138G     0   0   2   0   194 2773  83  81 1269 26734 19190  3  3 94
1 0 0 115G  138G     9   0   0   0    90 1404   0   0  833 18907 11455  2  2 96
2 0 0 115G  138G    13   0   0   0  1309 1374   0   0 3185 25773 20054  3  3 94
1 0 0 115G  138G  1419   0   0   0  2750 1369   0   0 3899 25403 23252  7  4 90
0 0 0 115G  138G   776   0   1   0   164 1368  75  58  837 26261 16368  3  3 94
1 0 0 115G  138G  2336   0   5   0  2562 1367   0   0 1337 23287 13288  3  3 94
0 0 0 115G  138G   560   0   0   0  1193 2785   0   0  608 27176 14512  5  5 90
1 0 0 115G  138G     0   0   2   0   249 1369   0   0  702 18533 10700  1  2 97
1 0 0 115G  138G  3290   0   0   0  2313 1369  91  96 1461 22049 14726  6  3 91

About NFS i really dont know how to look for problems there. But here is a output from

nfsstat -c

Client Info:
Rpc Counts:
  Getattr   Setattr    Lookup  Readlink      Read     Write    Create    Remove
 44956931   1020943  93567574       167  23609403    879028    514647    665228
   Rename      Link   Symlink     Mkdir     Rmdir   Readdir  RdirPlus    Access
    36867      1387         1     24655     21955   6118822         0  26166205
    Mknod    Fsstat    Fsinfo  PathConf    Commit
        0   5489407         1      2270    830867
Rpc Info:
 TimedOut   Invalid X Replies   Retries  Requests
        0         0         0         0 203906224
Cache Info:
Attr Hits    Misses Lkup Hits    Misses BioR Hits    Misses BioW Hits    Misses
-719986429  44956925 -1243965171  93531884  66678251  22460288    981123    879028
BioRLHits    Misses BioD Hits    Misses DirE Hits    Misses Accs Hits    Misses
      144       167  14572148   5721030   5124486      1455 -1123294109  26165764

and from

nfsstat -w 1 -c

GtAttr Lookup Rdlink   Read  Write Rename Access  Rddir
      5      0      0      5      0      0      0      2
      9    342      0      9      0      0     42      9
     12     91      0     21      0      0     21      4
      0      2      0      0      0      0      2      0
      0      1      0      0      0      0      0      0
      0      5      0      0      0      0      2      0
      5    124      0      5      0      0      0      2
      6     12      0      5      0      0     12      2
      4      0      0      5      0      0      0      2
      9      0      0     10      0      0      0      4
      4      0      0      5      0      0      0      2
     50      1      0     14      0      0      0      7

and finally output from

systat -ifstat

                /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
 Load Average   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 29.6

  Interface           Traffic               Peak                Total
        lo0  in     34.285 KB/s        291.936 KB/s           69.263 GB
             out    34.285 KB/s        291.936 KB/s           69.263 GB

       bce1  in    792.808 KB/s          5.382 MB/s          707.266 GB
             out    56.828 KB/s        238.912 KB/s           91.154 GB

       bce0  in     21.711 KB/s         21.711 KB/s           17.338 GB
             out    13.799 KB/s        287.402 KB/s           64.000 GB

As requested dmesg:

[larsemil@prison01 ~]$ dmesg
Limiting open port RST response from 213 to 200 packets/sec
Limiting open port RST response from 2636 to 200 packets/sec
pid 22548 (php-fpm), uid 10000: exited on signal 11
pid 26938 (wkhtmltopdf), uid 10000: exited on signal 6 (core dumped)
[zone: pf states] PF states limit reached
Limiting icmp ping response from 9592 to 200 packets/sec
Limiting icmp ping response from 611 to 200 packets/sec
Limiting icmp ping response from 1792 to 200 packets/sec
Limiting icmp ping response from 2650 to 200 packets/sec
Limiting icmp ping response from 316 to 200 packets/sec
Limiting icmp ping response from 1758 to 200 packets/sec
Limiting icmp ping response from 2478 to 200 packets/sec
Limiting icmp ping response from 578 to 200 packets/sec
Limiting icmp ping response from 2028 to 200 packets/sec
Limiting icmp ping response from 3175 to 200 packets/sec
Limiting icmp ping response from 245 to 200 packets/sec
Limiting icmp ping response from 536 to 200 packets/sec
Limiting icmp ping response from 229 to 200 packets/sec
Limiting icmp ping response from 546 to 200 packets/sec
Limiting icmp ping response from 2239 to 200 packets/sec
Limiting icmp ping response from 3414 to 200 packets/sec
Limiting icmp ping response from 3033 to 200 packets/sec
Limiting icmp ping response from 1018 to 200 packets/sec
Limiting icmp ping response from 270 to 200 packets/sec
pid 34239 (php-fpm), uid 10000: exited on signal 11
pid 68427 (php-fpm), uid 10000: exited on signal 11

Any ideas are welcome!

Best Answer

Can you post dmesg output and any log messages from /var/log/messages?

What I see is that you have a 196GB ram machine that is trying to do everything in 3GB of ram... it is probably swapping furiously.

Mem: 3116M Active, 23G Inact, 23G Wired, 900M Buf, 138G Free ARC: 10G Total, 2612M MFU, 4553M MRU, 37M Anon, 89M Header, 2742M Other

Free ram is bad. You need to use the ram in the machine. Please post the output of sysctl vfs.zfs.arc_max Check here for zfs tuning for the ARC

Jails themselves do basically nothing. Processes in the jails will show up in top if they are running - looks like not much is going on.

FreeBSD top is different yes, the LA should be read relative to the number of cores (24). Your LA is high, but this is only because something cannot get the memory it needs.

Best Answer

Related Solutions

High traffic, slow response: where is the bottleneck

Linux – High load average due to high system cpu load (%sys)

Related Topic