Nfs – Find out what NFSD processes are actually doing

monitoringnfsprocesstopubuntu-12.04

When I view top on one of our servers there are a lot of nfsd processes consuming CPU:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
2769  root      20   0     0    0    0 R   20  0.0   2073:14 nfsd
2774  root      20   0     0    0    0 S   19  0.0   2058:44 nfsd
2767  root      20   0     0    0    0 S   18  0.0   2092:54 nfsd
2768  root      20   0     0    0    0 S   18  0.0   2076:56 nfsd
2771  root      20   0     0    0    0 S   17  0.0   2094:25 nfsd
2773  root      20   0     0    0    0 S   14  0.0   2091:34 nfsd
2772  root      20   0     0    0    0 S   14  0.0   2083:43 nfsd
2770  root      20   0     0    0    0 S   12  0.0   2077:59 nfsd

How do I find out what these are actually doing? Can I see a list of files being accessed by each PID, or any more info?

We're on Ubuntu Server 12.04.

I tried nfsstat but it's not giving me much useful info about what's actually going on.

Edit – Additional stuff tried based on comments/answers:

Doing lsof -p 2774 on each of the PIDs shows the following:

COMMAND  PID USER   FD      TYPE DEVICE SIZE/OFF NODE NAME
nfsd    2774 root  cwd       DIR    8,1     4096    2 /
nfsd    2774 root  rtd       DIR    8,1     4096    2 /
nfsd    2774 root  txt   unknown                      /proc/2774/exe

Does that mean no files are being accessed?


When I try and view a process with strace -f -p 2774 it gives me this error:

attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf

A tcpdump | grep nfs is showing tons of activity between two of our servers, over nfs, but as far as I'm aware they shouldn't be. A lot of entries like:

13:56:41.120020 IP 192.168.0.20.nfs > 192.168.0.21.729: Flags [.], ack 4282288820, win 32833, options [nop,nop,TS val 627282027 ecr 263985319,nop,nop,sack 3 {4282317780:4282319228}{4282297508:4282298956}{4282290268:4282291716}], len

Best Answer

In this kind of situation I often found very useful to capture the NFS traffic (e.g., with tcpdump or Wireshark) and have a look at it to see if there is a specific reason for the high load.

For example, you can use something like:

tcpdump -w filename.cap "port 2049"

to save only NFS traffic (being on port 2049) to a capture file, then you can open that file on a PC with Wireshark and analyze it more in detail—the last time I had a similar problem, it was a bunch of computation jobs from the same user who was over disk quota, and the clients (18 different machines) were trying over and over to write, raising the load on the old NFS server very high.