My current setup: 2 NFS servers sharing the same directory with identical content, 1 keepalived
server as SLB (or rather for failover in this scenario), and 1 NFSv4 client mounting through VIP. All systems run CentOS 6 (2.6.32-573.26.1.el6.x86_64). And because this is a testing environment, all machines are on the same subnet (192.168.66.xx). For reference, the IPs are as below.
99 VIP
100 nfs01
101 nfs02
102 client
103 keepalived01
The NFS servers are configured as such:
/root/share 192.168.66.0/24(ro,fsid=0,sync,no_root_squash
As for keepalived
, I am running it in DR mode (NAT mode fails to work at all).
vrrp_instance nfs {
interface eth0
state MASTER
virtual_router_id 51
priority 103 # 103 on master, 101 on backup
authentication {
auth_type PASS
auth_pass hiServer
}
virtual_ipaddress {
192.168.66.99/24 dev eth0
}
}
virtual_server 192.168.66.99 2049 {
delay_loop 6
lb_algo wlc
lb_kind DR
protocol TCP
real_server 192.168.66.100 2049 {
weight 100
TCP_CHECK {
connect_timeout 6
connect_port 2049
}
}
real_server 192.168.66.101 2049 {
weight 102
TCP_CHECK {
connect_port 2049
connect_timeout 6
}
}
}
Lastly the client mounts via this command:
mount -t nfs4 192.168.66.99:/ /nfsdata
The NFSv4 mount seems to function, though I haven't stress-tested it. One thing I notice is a period of time during failover, i.e. I shut down one of the NFS servers forcing keepalived
to move service to another NFS server, that the client will seem to hang for some time before responding. I believe this is due to the 90-second grace period.
The problem that nags me is that on the NFS servers, this line of log keeps showing every or so seconds, flooding the logs.
kernel: nfsd: peername failed (err 107)!
I've tried using tcpdump
to see what is causing the traffic and spotted repeating exchanges beteween the NFS server and the keepalived
server. At first I thought iptables
could be the culprit, but flushing them on both machines does not stop the error.
If there is a way to suppress the error I may call it a day (is there?), but my curiosity questions: does the NFS server have a reason to try to communicate with the keepalived
server in this scenario? Or perhaps there is something fundamentally wrong when setting up NFS HA this way, even though it seems to work?
Best Answer
Upon further inspection, the error
kernel: nfsd: peername failed (err 107)!
appears approximately every 6 seconds. The number seems to correspond to theconnection_timeout
option in the conf file, and indeed by stoppingkeepalived
service, the error stops appearing altogether.It seems by using
TCP_CHECK
on port 2049, the NFS servers will log the "bad" connection attempts sincekeepalived
is not sending NFS messages according to protocol.In the end I use
MISC_CHECK
instead to check for NFS servers' health (with a custom shell script callingrpcinfo
).