NFS low performance after some activity


I have a Linux cluster with nodes that have NFS mountpoints from a central server (actually the nodes are diskless and are booted over PXE). After some activity on the NFS mountpoints from the nodes, NFS seems to slow down drastically, e.g. ssh logins takes minutes, programs that depend on some files on the nfs share takes minutes to start, etc..

A restart of the nfs service on the server and/or also a reboot of the problematic node(s) solve the problem for a short period of time, however it always show up again soon. (Doing both seems to help a bit longer)

Server and nodes are run with CentOS 7.4 with Linux kernel 3.10.0-693.el7.x86_64 x86_64 and NFSv4 is used. The storage consists of 4 HDDs which are bundeled as a RAID10 (/dev/sda). The network connection between server and nodes is 1GBit/s each, and there is no evidence for dropped packets so far.

What can be the reason of a very slow reaction of NFS, which depends on former activity?

A shortened output of nfsstat on a node when experiencing a slow reaction of the filesystem gives:

Client rpc stats:

calls | retrans | authrefrsh
44154157 | 0 | 44154258

Client nfs v4:

null | read | write | commit | open | open_conf
0 0% | 58125 0% | 422038 1% | 6846 0% | 139899 0% | 0 0%

open_noat | open_dgrd | close | setattr | fsinfo | renew
30775986 95% | 144 0% | 70464 0% | 2639 0% | 9 0% | 0

Output of nfsiostat looks like (for fast nfs):

op/s rpc bklog
3596.86 0.00

read: ops/s | kB/s | kB/op | retrans avg RTT (ms) | avg exe (ms)
0.224 | 0.289 | 1.292 | 0 (0.0%) 0.441 | 1.151

write: ops/s | kB/s | kB/op | retrans | avg RTT (ms) | avg exe (ms)
33.837 | 47.329 | 1.399 | 0 (0.0%) | 0.452 | 1.406

Output of nfsiostat looks like (for slow nfs):

op/s | rpc bklog
183.75 | 0.00

read: ops/s | kB/s | kB/op | retrans | avg RTT (ms) | avg exe (ms)
0.012 | 1.158 | 99.426 | 0 (0.0%) | 2.708 | 16.656

write: ops/s | kB/s | kB/op | retrans avg RTT (ms) | avg exe (ms)
0.295 | 1.882 | 6.387 | 2 (0.0%) | 0.448 | 0.560

Here we see much lower ops/s and higher kB/op and avg exe durations.

iostat on the central server (when everything works fine):

Linux 3.10.0-693.el7.x86_64 09/27/2019 _x86_64_ (4 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
0.48 0.00 0.37 0.02 0.00 99.12

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdb 10.83 40.61 67.55 9423785 15673740
sda 0.71 5.67 2.54 1315496 590208
sdc 10.47 18.96 67.55 4398709 15673740
md127 0.00 0.12 0.00 27241 80
md126 10.83 59.42 66.92 13787337 15526832
md125 0.00 0.01 0.00 2228 0

and the same when everything is slow (however with no large differences):

Linux 3.10.0-693.el7.x86_64 10/14/2019 _x86_64_ (4 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
2.94 0.00 1.03 0.01 0.00 96.02

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdb 15.05 261.13 52.53 449712785 90460908
sda 0.54 7.23 35.45 12443668 61054912
sdc 14.97 257.76 52.53 443917089 90460908
md127 0.00 0.02 0.00 27241 112
md126 11.57 8.68 51.72 14953949 89075284
md125 0.00 0.00 0.00 2228 8

Please tell me if you need any further information.

Best Answer

My first thoughts are that you might want to check your iostats after that, it starts to sound like a caching issue.