NFS client unable to mount shares from otherwise responsive server

nfs

We are unable to mount NFS shares to our Debian Lenny Database/Web-App NFS client from our Fedora 8 NFS server. Manual mount command with options and Assisted mount using fstab options returns the same behavior. Machine crashed unexpectedly 6 days ago, but this problem appears to have arisen 3 days ago. (And Yes, just reported to me this morning by the staff responsible for it)

This same server is working properly for all other NFS clients. The NFS client also servers some of its shares back to other clients and the NFS server, which is also working properly.

Processes depending upon these mounts hang, and have been since the 26th. Cron has been turned off to keep the load average to an appropriate level.

Mounts are authenticating properly on the NFS server based upon 'authenticated mount request' messages on the server but the client is

# mount -vvv -t nfs server.example.org:/shared/foo /shared/foo/
mount: fstab path: "/etc/fstab"
mount: lock path:  "/etc/mtab~"
mount: temp path:  "/etc/mtab.tmp"
mount: spec:  "server.example.org:/shared/foo"
mount: node:  "/shared/foo/"
mount: types: "nfs"
mount: opts:  "(null)"
mount: external mount: argv[0] = "/sbin/mount.nfs"
mount: external mount: argv[1] = "server.example.org:/shared/foo"
mount: external mount: argv[2] = "/shared/foo/"
mount: external mount: argv[3] = "-v"
mount: external mount: argv[4] = "-o"
mount: external mount: argv[5] = "rw"
mount.nfs: trying 192.168.xxx.xxx prog 100003 vers 3 prot TCP port 2049
mount.nfs: trying 192.168.xxx.xxx prog 100005 vers 3 prot UDP port 51852

And there it sits indefinitely with not further on-screen output. Most likely because of the problem indicated by the following:

Mar 28 10:17:14 db kernel: [1299206.229436] mount.nfs     D e250c5d5     0 20597  20596
Mar 28 10:17:14 db kernel: [1299206.229439]        c0a3cde0 00000086 f7555b00 e250c5d5 0001ca16 c0a3cf6c ce0a9020 0000000d 
Mar 28 10:17:14 db kernel: [1299206.229444]        0013bc68 077ffe57 00000003 00000000 00000000 00000000 00000000 00000246 
Mar 28 10:17:14 db kernel: [1299206.229447]        c0a77c90 00000000 c0a77c98 ce000a7c f8e047c1 c02c93a4 f8e0479c f4518588 
Mar 28 10:17:14 db kernel: [1299206.229451] Call Trace:
Mar 28 10:17:14 db kernel: [1299206.229465]  [<f8e047c1>] rpc_wait_bit_killable+0x25/0x2a [sunrpc]
Mar 28 10:17:14 db kernel: [1299206.229485]  [<c02c93a4>] __wait_on_bit+0x33/0x58
Mar 28 10:17:14 db kernel: [1299206.229490]  [<f8e0479c>] rpc_wait_bit_killable+0x0/0x2a [sunrpc]
Mar 28 10:17:14 db kernel: [1299206.229505]  [<f8e0479c>] rpc_wait_bit_killable+0x0/0x2a [sunrpc]
Mar 28 10:17:14 db kernel: [1299206.229519]  [<c02c9428>] out_of_line_wait_on_bit+0x5f/0x67
Mar 28 10:17:14 db kernel: [1299206.229523]  [<c0138859>] wake_bit_function+0x0/0x3c
Mar 28 10:17:14 db kernel: [1299206.229528]  [<f8e04c06>] __rpc_execute+0xbe/0x1d9 [sunrpc]
Mar 28 10:17:14 db kernel: [1299206.229543]  [<f8dffa72>] rpc_run_task+0x40/0x45 [sunrpc]
Mar 28 10:17:14 db kernel: [1299206.229557]  [<f8dffb00>] rpc_call_sync+0x38/0x52 [sunrpc]
Mar 28 10:17:14 db kernel: [1299206.229573]  [<f8e80351>] nfs3_rpc_wrapper+0x14/0x49 [nfs]
Mar 28 10:17:14 db kernel: [1299206.229591]  [<f8e8044f>] do_proc_fsinfo+0x54/0x75 [nfs]
Mar 28 10:17:14 db kernel: [1299206.229607]  [<f8e80481>] nfs3_proc_fsinfo+0x11/0x36 [nfs]
Mar 28 10:17:14 db kernel: [1299206.229621]  [<f8e70514>] nfs_probe_fsinfo+0x78/0x47f [nfs]
Mar 28 10:17:14 db kernel: [1299206.229634]  [<f8dffd1f>] rpc_shutdown_client+0x9d/0xa5 [sunrpc]
Mar 28 10:17:14 db kernel: [1299206.229647]  [<f8dffb58>] rpc_ping+0x3e/0x47 [sunrpc]
Mar 28 10:17:14 db kernel: [1299206.229662]  [<f8e00845>] rpc_bind_new_program+0x69/0x6f [sunrpc]
Mar 28 10:17:14 db kernel: [1299206.229677]  [<f8e71584>] nfs_create_server+0x37b/0x4fa [nfs]
Mar 28 10:17:14 db kernel: [1299206.229693]  [<c01621c1>] __alloc_pages_internal+0xb5/0x34e
Mar 28 10:17:14 db kernel: [1299206.229700]  [<c013882c>] autoremove_wake_function+0x0/0x2d
Mar 28 10:17:14 db kernel: [1299206.229703]  [<c01e7e3c>] idr_get_empty_slot+0x11c/0x1ed
Mar 28 10:17:14 db kernel: [1299206.229711]  [<f8e78fbd>] nfs_get_sb+0x528/0x810 [nfs]
Mar 28 10:17:14 db kernel: [1299206.229724]  [<c01e8125>] idr_pre_get+0x21/0x2f
Mar 28 10:17:14 db kernel: [1299206.229729]  [<c0180159>] vfs_kern_mount+0x7b/0xed
Mar 28 10:17:14 db kernel: [1299206.229734]  [<c0180209>] do_kern_mount+0x2f/0xb8
Mar 28 10:17:14 db kernel: [1299206.229738]  [<c019264a>] do_new_mount+0x55/0x89
Mar 28 10:17:14 db kernel: [1299206.229743]  [<c0192825>] do_mount+0x1a7/0x1c6
Mar 28 10:17:14 db kernel: [1299206.229747]  [<c02ca52a>] error_code+0x72/0x78
Mar 28 10:17:14 db kernel: [1299206.229752]  [<c0190895>] copy_mount_options+0x90/0x109
Mar 28 10:17:14 db kernel: [1299206.229756]  [<c01928b1>] sys_mount+0x6d/0xa8
Mar 28 10:17:14 db kernel: [1299206.229760]  [<c0108857>] sysenter_past_esp+0x78/0xb1
Mar 28 10:17:14 db kernel: [1299206.229766]  =======================

Networking is operating properly as the production users on the Database Web-App Front End are seeing no interruptions in service nor any performance issues.

Memory is fine:

db:/var/log# free -m
             total       used       free     shared    buffers     cached
Mem:         24352      19426       4926          0        281      18283
-/+ buffers/cache:        860      23492
Swap:         7632          0       7632

/etc/fstab:

server.example.org:/shared/foo  /foo        nfs defaults    0 0

Relevant line from server's /etc/exports:
/shared/foo 192.168.xxx.xxx(rw,no_root_squash)

TCPDump looks normal. I can post it if someone wants me to, but it's rather large and doesn't appear to have anything distinctly nasty in the output.

Best Answer

I ran out of time to troubleshoot and ended up restarting the services after killing off some other hung mount attempts that the Devs had also fired off.

Restarting portmap and the Debian nfs service got this working again AFTER killing off the stuck client mount attempts. The NFS service restarted the rpc.statd, rpc.idmapd, and rpc.mountd processes.

Stacktraces were no longer being generated for new mount requests after the old mount attempts were killed.