NFS v3 versus v4

nfsnfs4solarissolaris-10

I am wondering why NFS v4 would be so much faster than NFS v3 and if there are any parameters on v3 that could be tweaked.

I mount a file system

sudo mount  -o  'rw,bg,hard,nointr,rsize=1048576,wsize=1048576,vers=4'  toto:/test /test

and then run

 dd if=/test/file  of=/dev/null bs=1024k

I can read 200-400MB/s but when I change version to vers=3, remount and rerun the dd I only get 90MB/s. The file I'm reading from is an in memory file on the NFS server. Both sides of the connection are Solaris and have 10GbE NIC. I avoid any client side caching by remounting between all tests. I used dtrace to see on the server to measure how fast data is being served via NFS. For both v3 and v4 I changed:

 nfs4_bsize
 nfs3_bsize

from default 32K to 1M (on v4 I maxed at 150MB/s with 32K)
I've tried tweaking

  • nfs3_max_threads
  • clnt_max_conns
  • nfs3_async_clusters

to improve the v3 performance, but no go.

On v3 if I run four parallel dd's the throughput goes down from 90MB/s to 70-80MBs which leads me to believe the problem is some shared resource and if so, then I'm wondering what it is and if I can increase that resource.

dtrace code to get window sizes:

#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option defaultargs

inline string ADDR=$$1;

dtrace:::BEGIN
{
       TITLE = 10;
       title = 0;
       printf("starting up ...\n");
       self->start = 0;
}

tcp:::send, tcp:::receive
/   self->start == 0  /
{
     walltime[args[1]->cs_cid]= timestamp;
     self->start = 1;
}

tcp:::send, tcp:::receive
/   title == 0  &&
     ( ADDR == NULL || args[3]->tcps_raddr == ADDR  ) /
{
      printf("%4s %15s %6s %6s %6s %8s %8s %8s %8s %8s  %8s %8s %8s  %8s %8s\n",
        "cid",
        "ip",
        "usend"    ,
        "urecd" ,
        "delta"  ,
        "send"  ,
        "recd"  ,
        "ssz"  ,
        "sscal"  ,
        "rsz",
        "rscal",
        "congw",
        "conthr",
        "flags",
        "retran"
      );
      title = TITLE ;
}

tcp:::send
/     ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) /
{
    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    this->delta= timestamp-walltime[args[1]->cs_cid];
    walltime[args[1]->cs_cid]=timestamp;
    this->flags="";
    this->flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),this->flags);
    printf("%5d %14s %6d %6d %6d %8d \ %-8s %8d %6d %8d  %8d %8d %12d %s %d  \n",
        args[1]->cs_cid%1000,
        args[3]->tcps_raddr  ,
        args[3]->tcps_snxt - args[3]->tcps_suna ,
        args[3]->tcps_rnxt - args[3]->tcps_rack,
        this->delta/1000,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        this->flags,
        args[3]->tcps_retransmit
      );
    this->flags=0;
    title--;
    this->delta=0;
}

tcp:::receive
/ nfs[args[1]->cs_cid] &&  ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) /
{
    this->delta= timestamp-walltime[args[1]->cs_cid];
    walltime[args[1]->cs_cid]=timestamp;
    this->flags="";
    this->flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),this->flags);
    printf("%5d %14s %6d %6d %6d %8s / %-8d %8d %6d %8d  %8d %8d %12d %s %d  \n",
        args[1]->cs_cid%1000,
        args[3]->tcps_raddr  ,
        args[3]->tcps_snxt - args[3]->tcps_suna ,
        args[3]->tcps_rnxt - args[3]->tcps_rack,
        this->delta/1000,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        this->flags,
        args[3]->tcps_retransmit
      );
    this->flags=0;
    title--;
    this->delta=0;
}

Output looks like ( not from this particular situation):

cid              ip  usend  urecd  delta     send     recd      ssz    sscal      rsz     rscal    congw   conthr     flags   retran
  320 192.168.100.186    240      0    272      240 \             49232      0  1049800         5  1049800         2896 ACK|PUSH| 0
  320 192.168.100.186    240      0    196          / 68          49232      0  1049800         5  1049800         2896 ACK|PUSH| 0
  320 192.168.100.186      0      0  27445        0 \             49232      0  1049800         5  1049800         2896 ACK| 0
   24 192.168.100.177      0      0 255562          / 52          64060      0    64240         0    91980         2920 ACK|PUSH| 0
   24 192.168.100.177     52      0    301       52 \             64060      0    64240         0    91980         2920 ACK|PUSH| 0

some headers

usend - unacknowledged send bytes
urecd - unacknowledged received bytes
ssz - send window
rsz - receive window
congw - congestion window

planning on taking snoop's of the dd's over v3 and v4 and comparing. Have already done it but there was too much traffic and I used a disk file instead of a cached file which made comparing timings meaningless. Will run other snoop's with cached data and no other traffic between boxes. TBD

Additionally the network guys say there is no traffic shaping or bandwidth limiters on the connections.

Best Answer

NFS 4.1 (minor 1) is designed to be a faster and more efficient protocol and is recommended over previous versions, especially 4.0.

This includes client-side caching, and although not relevant in this scenario, parallel-NFS (pNFS). The major change is that is that the protocol is now stateful.

http://www.netapp.com/us/communities/tech-ontap/nfsv4-0408.html

I think it is the recommended protocol when using NetApps, judging by their performance documentation. The technology is similar to Windows Vista+ opportunistic locking.

NFSv4 differs from previous versions of NFS by allowing a server to delegate specific actions on a file to a client to enable more aggressive client caching of data and to allow caching of the locking state. A server cedes control of file updates and the locking state to a client via a delegation. This reduces latency by allowing the client to perform various operations and cache data locally. Two types of delegations currently exist: read and write. The server has the ability to call back a delegation from a client should there be contention for a file. Once a client holds a delegation, it can perform operations on files whose data has been cached locally to avoid network latency and optimize I/O. The more aggressive caching that results from delegations can be a big help in environments with the following characteristics:

  • Frequent opens and closes
  • Frequent GETATTRs
  • File locking
  • Read-only sharing
  • High latency
  • Fast clients
  • Heavily loaded server with many clients