I am currently trying to get down to the core of a problem where my
LVS-director seems to drop a packet coming from a client from time to
time. We have this problem on our production systems and can reproduce
the problem on staging.
I posted this problem on the lvs-users-mailing-list and got no response so far.
Our setup:
We are using ipvsadm with Linux CentOS5 x86_64 in a PV XEN-DomU.
Current Version details:
- Kernel: 2.6.18-348.1.1.el5xen
- ipvsadm: 1.24-13.el5
LVS-Setup:
We use IPVS in DR-mode, for managing the running connections we use
lvs-kiss.
ipvsadm
is running in a heartbeat-v1-cluster (two virtual nodes), master
and backup are running constantly on both nodes.
For the LVS-services we use logical IPs being setup by heartbeat
(active/passive-clustermode)
The real-servers are physical Linux-machines.
Network-Setup:
The VM acting as director is running as XEN-PV-DomU on a Dom0 using bridged networks.
Networks "in play":
- abn-network (staging-network, used to connect the client to the director),
used by the real-servers to send the answer to the clients (direct routing approach),
used for ipvsadm slave/master multicast-traffic - lvs-network: This is a dedicated VLAN which connects director and real-servers
- DR-arp-problem: solved my suppressing arp-answers on the real-servers for the service-ip
- The service-IP is configured as logical IP on the lvs-interface on the real-servers.
- In this setup ip_forwarding is not needed anywhere (neither on
director, nor on real-server).
VM details:
1 GB RAM, 2 vCPUs, system-load almost 0, memory 73M free, 224M buffers, 536M cache, no swap.
top
shows almost always 100% idle, 0% us/sy/ni/wa/hi/si/st.
Configuration details:
ipvsadm -Ln
for the service in question shows:
TCP x.y.183.217:12405 wrr persistent 7200
-> 192.168.83.234:12405 Route 1000 0 0
-> 192.168.83.235:12405 Route 1000 0 0
x.y first two octets are from our internal class-B-range.
We use 192.168.83.x as lvs-network for staging.
Persistent ipvsadm-configuration:
/etc/sysconfig/ipvsadm: –set 20 20 20
Cluster-configuration:
/etc/ha.d/haresources: $primary_directorname lvs-kiss x.y.183.217
lvs-kiss-configuration-snippet for the service above:
<VirtualServer idm-abn:12405>
ServiceType tcp
Scheduler wrr
DynamicScheduler 0
Persistance 7200
QueueSize 2
Fuzz 0.1
<RealServer rs1-lvs:12405>
PacketForwardingMethod gatewaying
Test ping -c 1 -nq -W 1 rs1-lvs >/dev/null
RunOnFailure "/sbin/ipvsadm -d -t idm-abn:12405 -r rs1-lvs"
RunOnRecovery "/sbin/ipvsadm -a -t idm-abn:12405 -r rs1-lvs"
</RealServer>
<RealServer rs2-lvs:12405>
PacketForwardingMethod gatewaying
Test ping -c 1 -nq -W 1 rs2-lvs >/dev/null
RunOnFailure "/sbin/ipvsadm -d -t idm-abn:12405 -r rs2-lvs"
RunOnRecovery "/sbin/ipvsadm -a -t idm-abn:12405 -r rs2-lvs"
</RealServer>
</VirtualServer>
idm-abn, rs1 and rs2 resolve via /etc/hosts.
About the service:
This is a soa-web-service.
How we reproduce the error:
From a client we run constant calls to the web-service at an interval of one call in three seconds.
From time to time there will be a connection reset from the director to the client.
Interesting: This happens on n x 100th + 1 tries – interesting is the one.
What we did to trace down the problem:
- Checked /proc/sys/net/ipv4/vs: all values are set to default, so drop_packet is NOT in place (=0)
- tcpdump on client, fronted/abn of the director, backend/lvs of the directory, lvs and abn of the real-servers
In this tcpdump we could see a request from the client, answered by a
connection-reset by the director.
The packet was NOT forwarded via LVS.
I welcome any ideas on how to track this problem further down.
If any information is unclear/missing to drill down the problem – please
ask.
Best Answer
Do you have any stateful iptables rules on the LVS-DR director? As I can see you are using port 12405, so if you have a rule like this:
In LVS-DR real servers are replying to requests from clients (and not the director), the director won't add those connections in the connection tracking table and the
FIN
packets won't be detected on the director's iptables with the rulesESTABLISHED,RELATED
. Since you only allowNEW
(SYN
) packets on port 12405,FIN
will be blocked. You have to use a stateless firewall on an LVS-DR director for load balanced services: