Linux NFS hangs after about 15 minutes

linuxlinux-kernelnfs

I'm somewhat new when it comes to administering Linux and NFS, so please bear with me.

We are trying to setup a small cluster at work. Currently the system is just 2 DELL high-end workstations, running CentOS 6.5.
To make administering users and files easier, we have decided to share via NFS the /home directory, and four files from /etc (passwd, group, shadow, and gshadow (This is done by moving them to a sub-directory, and using links to put them back in /etc))

The files are shared with this in /etc/exports on the server:

/home/  x.x.x.0/24(rw,sync,no_root_squash,no_all_squash)  
/etc/sub_dir/   x.x.x.0/24(rw,sync,no_root_squash,no_all_squash)  

The files are mounted with this in /etc/fstab on the client:

server_name:/home/          /home/          nfs rw,sync,hard,intr 0 0
server_name:/etc/sub_dir/           /etc/sub_dir/           nfs rw,sync,hard,intr 0 0

After setting all this up, the system worked fine for about a month. As long as the server was on, when the client started all the server's users were visible, and as well were all the files.

However, about 5 days ago, it started to act up.
After starting the client, it would work fine for about 15 minutes (give or take). The users could log on locally or via SSH.
After that first 15 minutes, the system would almost totally lock up. New users couldn't log in, and logged in users couldn't do anything. (Basic things like moving the mouse still work). Only way to get the system going again is to turn the client off and back on.
Unfortunately, this also means debugging on the client is very difficult.

We have narrowed down the problem to have something to do with the NFS shares of the files described above. (We know this because disabling the mounts in /etc/fstab allows the client to go back to it's own local files, and everything works fine)

The farthest we can figure is the system starts, mounts everything, and works. Then, the connections drop, and the next time the client needs access to a file (for example, passwd) it can't find it, and the system hangs waiting for the connection.

The computers are on the same 1000 Mbps switch, and the load is fairly light.

Any help would be much appreciated.

UPDATES:

I have been doing some more digging. I found this similar issue here on Server Fault, but it too is unresolved.
I have also tried switching to UDP, but that also didn't solve the issue.
I came across articles explaining how to find and solve NFS being stale (but I'm not sure that's the issue).

UPDATES:

I managed to get a log from the client from /var/log/messages (during a few minutes of operation).
Looking through it, I found a repeating pattern of nfsidmap being "blocked", then a Call Trace with lots of "[nfs]" entries in it.

Aug  7 14:17:01 computer-name kernel: INFO: task crond:10578 blocked for more than 120 seconds.
Aug  7 14:17:01 computer-name kernel:      Tainted: P           ---------------    2.6.32-431.20.3.el6.x86_64 #1
Aug  7 14:17:01 computer-name kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  7 14:17:01 computer-name kernel: crond         D 000000000000000e     0 10578      1 0x00000080
Aug  7 14:17:01 computer-name kernel: ffff880a5cf0b148 0000000000000082 0000000000000000 ffffffff81059096
Aug  7 14:17:01 computer-name kernel: ffff880a5cf0b0d8 ffff880a5f77eaa0 ffff880a5cf0b0d8 ffffffff8105559d
Aug  7 14:17:01 computer-name kernel: ffff880a555a5098 ffff880a5cf0bfd8 000000000000fbc8 ffff880a555a5098
Aug  7 14:17:01 computer-name kernel: Call Trace:
Aug  7 14:17:01 computer-name kernel: [<ffffffff81059096>] ? enqueue_task+0x66/0x80
Aug  7 14:17:01 computer-name kernel: [<ffffffff8105559d>] ? check_preempt_curr+0x6d/0x90
Aug  7 14:17:01 computer-name kernel: [<ffffffff815296d5>] schedule_timeout+0x215/0x2e0
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109afb6>] ? autoremove_wake_function+0x16/0x40
Aug  7 14:17:01 computer-name kernel: [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
Aug  7 14:17:01 computer-name kernel: [<ffffffff81529353>] wait_for_common+0x123/0x180
Aug  7 14:17:01 computer-name kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff81095211>] ? __queue_work+0x41/0x50
Aug  7 14:17:01 computer-name kernel: [<ffffffff8152946d>] wait_for_completion+0x1d/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109386c>] call_usermodehelper_exec+0x10c/0x120
Aug  7 14:17:01 computer-name kernel: [<ffffffff812246ae>] call_sbin_request_key+0x24e/0x2f0
Aug  7 14:17:01 computer-name kernel: [<ffffffff8121eb03>] ? key_instantiate_and_link+0xa3/0xb0
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1060030>] ? nfs4_callback_layoutrecall+0x30/0x90 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffff812241e5>] request_key_and_link+0x315/0x3d0
Aug  7 14:17:01 computer-name kernel: [<ffffffff812243b0>] request_key+0x50/0xa0
Aug  7 14:17:01 computer-name kernel: [<ffffffffa105cb65>] nfs_idmap_request_key+0xc5/0x170 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa105d194>] nfs_idmap_lookup_id+0x34/0x80 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa105d5d5>] nfs_map_name_to_uid+0x75/0xa0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057504>] decode_getfattr_attrs+0xf64/0xfa0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
Aug  7 14:17:01 computer-name kernel: [<ffffffffa10575c3>] decode_getfattr_generic.clone.0+0x83/0xe0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057ce0>] nfs4_xdr_dec_access+0xb0/0xc0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057c30>] ? nfs4_xdr_dec_access+0x0/0xc0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f90fc4>] rpcauth_unwrap_resp+0x84/0xb0 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057c30>] ? nfs4_xdr_dec_access+0x0/0xc0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f85923>] call_decode+0x1b3/0x800 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109b020>] ? wake_bit_function+0x0/0x50
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f85770>] ? call_decode+0x0/0x800 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f8f677>] __rpc_execute+0x77/0x350 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109ae27>] ? bit_waitqueue+0x17/0xd0
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f8f9b1>] rpc_execute+0x61/0xa0 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f863a5>] rpc_run_task+0x75/0x90 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f864c2>] rpc_call_sync+0x42/0x70 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa104ba9e>] _nfs4_call_sync+0x3e/0x40 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa104a7cc>] _nfs4_proc_access+0x11c/0x1a0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa104a89b>] nfs4_proc_access+0x4b/0x80 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa102658c>] nfs_do_access+0x19c/0x240 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f92625>] ? generic_lookup_cred+0x15/0x20 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f915f0>] ? rpcauth_lookupcred+0x70/0xc0 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa10266d8>] nfs_permission+0xa8/0x1e0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffff81198e93>] __link_path_walk+0xb3/0x1000
Aug  7 14:17:01 computer-name kernel: [<ffffffff81199abf>] __link_path_walk+0xcdf/0x1000
Aug  7 14:17:01 computer-name kernel: [<ffffffff8119a09a>] path_walk+0x6a/0xe0
Aug  7 14:17:01 computer-name kernel: [<ffffffff8119a2ab>] filename_lookup+0x6b/0xc0
Aug  7 14:17:01 computer-name kernel: [<ffffffff81226c26>] ? security_file_alloc+0x16/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff8119b784>] do_filp_open+0x104/0xd20
Aug  7 14:17:01 computer-name kernel: [<ffffffff8128f70a>] ? strncpy_from_user+0x4a/0x90
Aug  7 14:17:01 computer-name kernel: [<ffffffff811a8a62>] ? alloc_fd+0x92/0x160
Aug  7 14:17:01 computer-name kernel: [<ffffffff81185ba9>] do_sys_open+0x69/0x140
Aug  7 14:17:01 computer-name kernel: [<ffffffff81185cc0>] sys_open+0x20/0x30
Aug  7 14:17:01 computer-name kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Aug  7 14:17:01 computer-name kernel: INFO: task nfsidmap:13767 blocked for more than 120 seconds.
Aug  7 14:17:01 computer-name kernel:      Tainted: P           ---------------    2.6.32-431.20.3.el6.x86_64 #1
Aug  7 14:17:01 computer-name kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  7 14:17:01 computer-name kernel: nfsidmap      D 000000000000000e     0 13767  13766 0x00000080
Aug  7 14:17:01 computer-name kernel: ffff88145d1e93f8 0000000000000086 0000000000000000 ffff880a5bae6080
Aug  7 14:17:01 computer-name kernel: ffff88145d1e9378 ffffffff814b1d35 0000000053e3c1ad 0000000028930cb7
Aug  7 14:17:01 computer-name kernel: ffff88145139f058 ffff88145d1e9fd8 000000000000fbc8 ffff88145139f058
Aug  7 14:17:01 computer-name kernel: Call Trace:
Aug  7 14:17:01 computer-name kernel: [<ffffffff814b1d35>] ? tcp_event_new_data_sent+0xb5/0x110
Aug  7 14:17:01 computer-name kernel: [<ffffffff81223d90>] ? key_wait_bit+0x0/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff81223d9e>] key_wait_bit+0xe/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff81529a8f>] __wait_on_bit+0x5f/0x90
Aug  7 14:17:01 computer-name kernel: [<ffffffff81223d90>] ? key_wait_bit+0x0/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff81529b38>] out_of_line_wait_on_bit+0x78/0x90
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109b020>] ? wake_bit_function+0x0/0x50
Aug  7 14:17:01 computer-name kernel: [<ffffffff81223d7e>] wait_for_key_construction+0x6e/0x80
Aug  7 14:17:01 computer-name kernel: [<ffffffff812243c5>] request_key+0x65/0xa0
Aug  7 14:17:01 computer-name kernel: [<ffffffffa105cb65>] nfs_idmap_request_key+0xc5/0x170 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa105d194>] nfs_idmap_lookup_id+0x34/0x80 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa105d5d5>] nfs_map_name_to_uid+0x75/0xa0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057504>] decode_getfattr_attrs+0xf64/0xfa0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
Aug  7 14:17:01 computer-name kernel: [<ffffffffa10575c3>] decode_getfattr_generic.clone.0+0x83/0xe0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057ce0>] nfs4_xdr_dec_access+0xb0/0xc0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057c30>] ? nfs4_xdr_dec_access+0x0/0xc0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f90fc4>] rpcauth_unwrap_resp+0x84/0xb0 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057c30>] ? nfs4_xdr_dec_access+0x0/0xc0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f85923>] call_decode+0x1b3/0x800 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109b020>] ? wake_bit_function+0x0/0x50
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f85770>] ? call_decode+0x0/0x800 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f8f677>] __rpc_execute+0x77/0x350 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109ae27>] ? bit_waitqueue+0x17/0xd0
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f8f9b1>] rpc_execute+0x61/0xa0 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f863a5>] rpc_run_task+0x75/0x90 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f864c2>] rpc_call_sync+0x42/0x70 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa104ba9e>] _nfs4_call_sync+0x3e/0x40 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa104a7cc>] _nfs4_proc_access+0x11c/0x1a0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa104a89b>] nfs4_proc_access+0x4b/0x80 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa102658c>] nfs_do_access+0x19c/0x240 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f92625>] ? generic_lookup_cred+0x15/0x20 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f915f0>] ? rpcauth_lookupcred+0x70/0xc0 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa10266d8>] nfs_permission+0xa8/0x1e0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffff81198e93>] __link_path_walk+0xb3/0x1000
Aug  7 14:17:01 computer-name kernel: [<ffffffff81199abf>] __link_path_walk+0xcdf/0x1000
Aug  7 14:17:01 computer-name kernel: [<ffffffff8119a09a>] path_walk+0x6a/0xe0
Aug  7 14:17:01 computer-name kernel: [<ffffffff8119a2ab>] filename_lookup+0x6b/0xc0
Aug  7 14:17:01 computer-name kernel: [<ffffffff81226c26>] ? security_file_alloc+0x16/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff8119b784>] do_filp_open+0x104/0xd20
Aug  7 14:17:01 computer-name kernel: [<ffffffff811a27e8>] ? d_free+0x58/0x60
Aug  7 14:17:01 computer-name kernel: [<ffffffff8128f70a>] ? strncpy_from_user+0x4a/0x90
Aug  7 14:17:01 computer-name kernel: [<ffffffff811a8a62>] ? alloc_fd+0x92/0x160
Aug  7 14:17:01 computer-name kernel: [<ffffffff81185ba9>] do_sys_open+0x69/0x140
Aug  7 14:17:01 computer-name kernel: [<ffffffff81185cc0>] sys_open+0x20/0x30
Aug  7 14:17:01 computer-name kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

(Sorry about the length, I have no idea what parts are useful)
This pattern repeats every two minutes.

According to this and this, the message indicates that there is some kind of resource starvation. However, the client is typically idle.

Best Answer

The kernel error messages indicate it was not able to schedule a process to run for 120 seconds. It's either extreme high CPU usage or contention at the I/O level.

I would recommend against using NFS to share system critical files like /etc/passwd or even symlinks because the very NFS operations is dependent on them. You could think about setting up a script to transfer them via SCP and overwrite the current ones, but then you've to think about the logic to determine which server has the newer files.

The long-term solution would be to adopt LDAP.

EDIT: based on additional information provided in comments, changing from NFSv4 to NFSv3 is an alternative.