This can actually be caused a bug. I know because I've had to fix it myself.
According to the RFC, when priorities are equal on both nodes;
If the Priority in the ADVERTISEMENT is equal to the local
Priority and the primary IP Address of the sender is greater
than the local primary IP Address, then:
o Cancel Adver_Timer
o Set Master_Down_Timer to Master_Down_Interval
o Transition to the {Backup} state
So, he who has the biggest IP address will win.
In keepalived, the way this is done is basically wrong. Endianness is not considered properly when doing this comparison.
Lets imagine we have two routers, (A)10.1.1.200 and (B)10.1.1.201.
The code should perform the following comparison.
On A:
if (10.1.1.201 > 10.1.1.200) // True
be_backup();
On B:
if (10.1.1.200 > 10.1.1.201) // False
be_master();
However because the endianness is not incorrectly handled, the following comparison is made instead.
On A:
if (10.1.1.201 > 200.1.1.10) // False
be_master();
On B:
if (10.1.1.200 > 201.1.1.10) // False
be_master();
This patch should work, but i've remade it from my original patch and have not tested it. Not even tested it compiles! So no refunds!
--- vrrp/vrrp.c.old 2013-10-13 17:39:29.421000176 +0100
+++ vrrp/vrrp.c 2013-10-13 18:07:57.360000966 +0100
@@ -923,7 +923,7 @@
} else if (vrrp->family == AF_INET) {
if (hd->priority > vrrp->effective_priority ||
(hd->priority == vrrp->effective_priority &&
- ntohl(saddr) > ntohl(VRRP_PKT_SADDR(vrrp)))) {
+ ntohl(saddr) > VRRP_PKT_SADDR(vrrp))) {
log_message(LOG_INFO, "VRRP_Instance(%s) Received higher prio advert"
, vrrp->iname);
if (proto == IPPROTO_IPSEC_AH) {
From the LVS mailing list
None of the current IPVS schedulers do know "highest weight" balancing.
With the "weighted" schedulers, you can e.g. give your primary server
a weight of max. 65535 and your secondary server a weight of 1. This way,
you've "almost" reached the point you're asking for - however, one out
of 64k of incoming connections will go for the "secondary" server even
while the primary server is still up and running.
If your application is balancing-ready, this behaviour may be a good thing.
For example, by automatically using the secondary system for a few live
requests, you ensure your secondary system is actually working.
By sending some live traffic, you may also "warm up" application-specific
caches, so upon a "real" failover, the application will perform much better
than with empty caches.
If you really don't need (or your applications can't handle) the
"balancing" part (distribute traffic to different servers at the same time),
you'd probably better run "typical" high availability/failover software
like Pacemaker or some VRRP daemon.
For example, you might put all three boxes into the same VRPR instance
and assign them different VRRP priorities, and VRRP will sort out which box
has the "best" priority and is going to be the only live system. This results
in some kind of "cascading" failover.
If you need balancing to distribute traffic among different servers,
and you'd still like to have this "cascading" failover, you'll need to run
at least two balancer (pairs): one for the "primary" server farm, with the
VIP of the other balancer being set as sorry server. The second balancer
in turn balances to the "secondary" server farm and also has the maintenance
server set as a sorry server.
One usecase for such scenarios are web farms with slightly different content:
if the primary farm drops out of service (e.g. due to overload or some
bleeding-edge feature malfunctioning), the secondary farm may serve a less
feature-rich version of the same service.
Best Answer
We have a similar setup, but using kamailio instead of haproxy. Anyway, we were seeing messages like that, so we change the way we were performing the checks (our checks have nothing to do with yours, we were checking that kamailio responds to OPTIONs requests).
You can try to add
fall 3
, which means that the check script should fail 3 times before changing state. Also,weight
is useless in thevrrp_script
section.Good luck!