Nfs – Find out what NFSD processes are actually doing

monitoringnfsprocesstopubuntu-12.04

When I view top on one of our servers there are a lot of nfsd processes consuming CPU:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
2769  root      20   0     0    0    0 R   20  0.0   2073:14 nfsd
2774  root      20   0     0    0    0 S   19  0.0   2058:44 nfsd
2767  root      20   0     0    0    0 S   18  0.0   2092:54 nfsd
2768  root      20   0     0    0    0 S   18  0.0   2076:56 nfsd
2771  root      20   0     0    0    0 S   17  0.0   2094:25 nfsd
2773  root      20   0     0    0    0 S   14  0.0   2091:34 nfsd
2772  root      20   0     0    0    0 S   14  0.0   2083:43 nfsd
2770  root      20   0     0    0    0 S   12  0.0   2077:59 nfsd

How do I find out what these are actually doing? Can I see a list of files being accessed by each PID, or any more info?

We're on Ubuntu Server 12.04.

I tried nfsstat but it's not giving me much useful info about what's actually going on.

Edit – Additional stuff tried based on comments/answers:

Doing lsof -p 2774 on each of the PIDs shows the following:

COMMAND  PID USER   FD      TYPE DEVICE SIZE/OFF NODE NAME
nfsd    2774 root  cwd       DIR    8,1     4096    2 /
nfsd    2774 root  rtd       DIR    8,1     4096    2 /
nfsd    2774 root  txt   unknown                      /proc/2774/exe

Does that mean no files are being accessed?

When I try and view a process with strace -f -p 2774 it gives me this error:

attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf

A tcpdump | grep nfs is showing tons of activity between two of our servers, over nfs, but as far as I'm aware they shouldn't be. A lot of entries like:

13:56:41.120020 IP 192.168.0.20.nfs > 192.168.0.21.729: Flags [.], ack 4282288820, win 32833, options [nop,nop,TS val 627282027 ecr 263985319,nop,nop,sack 3 {4282317780:4282319228}{4282297508:4282298956}{4282290268:4282291716}], len

Best Answer

In this kind of situation I often found very useful to capture the NFS traffic (e.g., with tcpdump or Wireshark) and have a look at it to see if there is a specific reason for the high load.

For example, you can use something like:

tcpdump -w filename.cap "port 2049"

to save only NFS traffic (being on port 2049) to a capture file, then you can open that file on a PC with Wireshark and analyze it more in detail—the last time I had a similar problem, it was a bunch of computation jobs from the same user who was over disk quota, and the clients (18 different machines) were trying over and over to write, raising the load on the old NFS server very high.

Commenting on the tooling.

We've tried online wiki's but found a number of limitations, which may be personal taste, but include document structure and most critically having to be connected to the documentation server.

Being connected is a serious problem if you are either offline or onsite (obviously you can mitigate the onsite with a secured SSL connection et. al.)

Our current documentation process is:

static html generator
markdown syntax
distributed versioning system

We have a 'formal' layout for the documentation and that provides the structure for the menus (and the associated CSS for visual styling etc.)

Static HTML Generator

We use an in house static html generator based on cubictemp and a number of other tools: pygments, docutils.

The generated pages are (not?)obviously ugly looking, as most of us/our sysadmins/programmers know what is aesthetically beautiful but have a total lack of coordination into building such.

But it affords/let's us include configuration files, sample scripts, pdf, etc, without having to worry about html formatting screwing it up or worrying about where to find it on the 'server' for downloading.

If it's not HTML, just drop it in the folder and add a url link to it.

HTML provides 'potential' structure for layout, and also critically provides 'linking' between knowledge/content items (as well as base structure mechanisms, such as being able to create menus, table's of content etc.) With HTML, each user can now run a small web server on their machine whether lighttpd or something small or just go full blown with Apache or IIS.

All our machines have the grunt for basic html serving, and works well enough for us.

MARKDOWN syntax.

We use a bastardised version of MARKDOWN, Textish and or reStructuredTEXT to let our 'creative' juices write documentation without having to worry about HTML.

It also means everyone get's to use their favourite editor (I use Scintilla on Windows and *Nix) while the others here use vi/vim.

Distributed Versioning System.

We use Git to 'distribute' the documentation between users. Oh, and we use it's versioning capacity too.

The key advantage for us is that we can all work on updating the documentation without having to be connected to the server, and without having to publish 'completed' works. We can all work on the same parts of the documentation, or different parts, or just consume the information.

Personally, I hate being tied to a server to update blogs let alone wiki's. Git works well for us.

Commenting on the Workflow

Wiki's seem to be the "fashion" for knowledge dissemination / codification, but as commented elsewhere all processes become difficult to sustain, and finding the mix of tools that best supports your teams needs and is sustainable will take time.

The better solutions just end up being discovered and not mandated.

Linux – How to Find and Kill Old Processes

YOu can do this with a combination of ps , awk and kill:

ps -eo pid,etime,comm

Gives you a three column output, with the process PID, the elapsed time since the process started, and the command name, without arguments. The elapsed time looks like one of these:

mm:ss
hh:mm:ss
d-hh:mm:ss

Since you want processes that have been running for more than a week, you would look for lines matching that third pattern. You can use awk to filter out the processes by running time and by command name, like this:

ps -eo pid,etime,comm | awk '$2~/^7-/ && $3~/mycommand/ { print $1 }'

which will print the pids of all commands matching 'mycommand' which have been running for more than 7 days. Pipe that list into kill, and you're done:

ps -eo pid,etime,comm | awk '$2~/^7-/ && $3~/mycommand/ { print $1 }' | kill -9