Linux – How to monitor NFS load from userland


Apologies if I'm not using the proper jargon (although I'm a longtime linux user, I'm not an admin) or if this is a FAQ (though searching SE got lots of hits, I didn't see anything quite like this question):

I'm a user on a science cluster (with jobs managed by PBS/Torque, on RHEL5, FWIW). I'm about to start my first really-big job, so I asked the admin some configuration questions, to avoid stupid mistakes. I was mostly right, but he added the advice to "make sure you are not hammering the disk server with too much I/O," with followup to "use top [to] see if the nfs is going nuts."

How to do that? This is a cluster, so a lot is going on "behind the scenes" that is transparent to me. Plus I have next-to-no privileges. I also am limited to CLI via SSH, but that's the least of my problems. On the plus side, I do seem to be able to shell into any of the compute nodes, including those with attached disk(s).

So I'm wondering, how best to monitor NFS from userland? I know a little bit about top and NFS, so I know I can do

top -p$(pgrep nfsd -d ',')

to get the list of NFS processes (no?). But what I'd really like to know–again, as a user (I have neither sudo nor root) on RHEL5 (yes, we're still running that)–are

  1. One, or a few, aggregate statistics for NFS load across all NFS processes. Is this something I can get from top or another tool, without scraping output and doing my own math? And should I be monitoring processes other than nfsd?
  2. Advice concerning quantification of "NFS going nuts." If I can get one/few aggregate statistics, I can presumably get a pre-my-job baseline, but that still doesn't tell me "how high is too high."

Note: top appears not to be the tool to use for this task, but at least it is available to me. The list of tools which are not available include

  1. nfsstat
  2. iostat
  3. iotop

Best Answer

Looking at top output is completely wrong. It's about the IOPS. To get a view on the NFS statistics, use nfsstat:

Server rpc stats:
calls      badcalls   badauth    badclnt    xdrcall
40833255   0          0          0          0       

Server nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 1411374   3% 107       0% 43169     0% 747514    1% 790       0% 
read         write        create       mkdir        symlink      mknod        
38138706 93% 0         0% 0         0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
0         0% 0         0% 0         0% 0         0% 0         0% 491559    1% 
fsstat       fsinfo       pathconf     commit       
6         0% 12        0% 6         0% 0         0% 

If you have a monitoring program (fer instance, Zabbix) you can add a UserParameter to watch them:

# NFS stats
UserParameter=nfs.v3.server[*],nfsstat -s -l | awk 'BEGIN {FS=": *"}/v3 server.*$1:/ {print $$2}'

and make pretty graphs: enter image description here

How high is too high? It totally depends on your workload:

nfs graph

You need to watch the filesystem and disk latency to see if you're overloading the disks.