Linux – Why does writing a file to an NFS share send a COMMIT operation to the NFS server

linuxnfs

I have a Debian squeeze (2.6.32-5-amd64) which is at the same time a NFS4 server and client (it mounts itself through NFS4). The local directory that leads directly to disk is /nfs4exports/mydir, whereas /nfs4mounts/mydir is the same thing mounted through NFS, using the machine's external IP address. Here is the line from fstab:

192.168.1.75:/mydir   /nfs4mounts/mydir      nfs4    soft  0 0

I have an application that writes many small files. If I write directly to /nfs4exports/mydir, it writes thousands of files per second; but if I write to /nfs4mounts/mydir, it writes 4 files per second or so. I can greatly increase speed if I add async to /etc/exports. (Writing a single large file to the NFS-mounted directory goes at more than 100 MB/s.)

I examine the server statistics and I see that whenever a file is written, it is "committed" (this also happens with NFSv3):

root@debianvboxtest:~# mount -t nfs4 192.168.1.75:/mydir /mnt
root@debianvboxtest:~# nfsstat|grep -A 2 'nfs v4 operations'
Server nfs v4 operations:
op0-unused   op1-unused   op2-future   access       close        commit       
0         0% 0         0% 0         0% 10        4% 1         0% 1         0% 
root@debianvboxtest:~# echo 'hello' >/mnt/test1056
root@debianvboxtest:~# nfsstat|grep -A 2 'nfs v4 operations'
Server nfs v4 operations:
op0-unused   op1-unused   op2-future   access       close        commit       
0         0% 0         0% 0         0% 11        4% 2         0% 2         0% 

Now in the RFC, I read this:

The COMMIT operation is similar in operation and semantics to the
POSIX fsync(2) system call that synchronizes a file's state with the
disk (file data and metadata is flushed to disk or stable storage).
COMMIT performs the same operation for a client, flushing any
unsynchronized data and metadata on the server to the server's disk or
stable storage for the specified file.

I don't understand why the client commits. I don't think that the "echo" shell built-in command runs fsync; if echo wrote to a local file and then the machine went down, the file might be lost. In contrast, the NFS client appears to be sending a COMMIT upon completion of the echo. Why?

I am reluctant to use the async NFS server option, because it would apparently ignore COMMIT. I feel as if I had a local filesystem and I had to choose between syncing every file upon close and ignoring fsync altogether. What have I understood wrong?

Best Answer

because this is how NFS works, and is exactly how it should work since it's a synchronous protocol. What you need to make sure is that the file system that is exported is backed by LUNs that have NVRAM/BBWC protection and properly handle fsync() - ie ignore that, and mask SCSI FUA flags and SCSI_CACHE_SYNCHRONIZE commands. Also make sure that the file system has no barriers enabled if it's backed by BBWC/NVRAM.

This way NFS keeps it's synchronous semantics and is equivalent to running fsync() after every write, but you get the performance of running asynchronously.