Linux – keepalived VRRP_script not failing over

keepalivedlinux

So I am running keepalived on two servers and I can't get it to failover to the other.

Below I have my config for one of the servers. The only different between the two is the priority numbers master being 110 and back being 109.

But when I stop my process with /etc/init.d/process stop keepalived doesn't fail over. I just get the VRRP_Script(chk_script) failed and nothing else. No failovers or nothing.

vrrp_script chk_script {
script "/usr/local/bin/failover.sh"
interval 2
weight 2
}

vrrp_instance HAInstance {
state BACKUP
interface eth0
virtual_router_id 8
priority 109
advert_int 1
nopreempt
vrrp_unicast_bind 10.10.10.8
vrrp_unicast_peer 10.10.10.9
virtual_ipaddress {
  10.10.10.10/16 dev eth0
}
notify /usr/local/bin/keepalivednotify.sh
track_script {
  chk_script weight 20
}
}

This is my chk_script below. The same problem also happens when I do killall -0 process as my script.

!/bin/bash
SERVICE='process'
STATUS=$(ps ax | grep -v grep | grep $SERVICE)

if [ "$STATUS" != "" ]
then
    exit 0
else
    exit 1
fi

Does anyone know a fix for this? Thanks.

Best Answer

I had exactly the same issue however my problem was not in the firewall nor in my Ethernet adapter but in the "weight" settings of the check script.

This was my configuration:

MASTER:

vrrp_instance haproxy {
state MASTER
interface eth0
virtual_router_id 51
priority 150
advert_int 1

BACKUP:

vrrp_instance haproxy {
state BACKUP
interface eth0
virtual_router_id 51
priority 100
advert_int 1

Check_script:

vrrp_script chk_haproxy {
   script "python /root/ha_check.py"
   interval 2     # check every 2 seconds
   weight 2
   rise 2
   fall 2

}

The reason the master was refusing to release the VIP was because despite the fact the script had failed, the master was still having higher priority number from the BACKUP server. This happened because the "weight" setting on check_script was not enough to cover the "GAP" between the priority number, meaning raising the priority number of the BACKUP server greater to the one of MASTER Server. I will further explain:

According to the manual of keepalived, a positive number on the "weight" setting will add that number to the priority if the check succeeds.
A negative number will subtract that number from priority number if the check fails.

So, according to my configuration:

Server Priorities Prior failure of the script:
MASTER: 152
BACKUP: 100
Failover_IP: MASTER

The failover ip is correctly "grabbed" by master server since Master has higher priority compared to Backup server (152 > 100)

Server Priorities AFTER failure of the script:
MASTER server: 148
BACKUP server: 102
Failover_IP: STILL ON MASTER

The failover ip is still on master server because Master has again higher priority compared to BACKUP (148 > 102). The MASTER server was refusing to release the IP and right he did since his priority was higher than the other server.

The solution on my situation was:

Solution -1 : Change the priority number of both servers so they dont have much "GAP".
For example:
Master Priority: 150
Backup Priority: 149
Check_script weight: As it is ( 2 ).

With the above configuration, when the script succeeds (meaning all is ok) the priorities would be:
Master: 152
Backup: 149
IP_Location: On Master (152 > 149)

When script fails:
Master: 150
Backup: 151
IP_Location: On Backup (151 > 150)

Solution - 2: Change the weight number of the script from 2, to -60

Related Topic