Cisco ASA 5512 (OS 9.1(1)) HA failover issues

ciscocisco-asafailover

I have 2 cisco ASA 5512 units (security plus license) in an Active / Standby HA Failover cluster (I've tested both stateful and stateless failover unsuccessfully).

Both units are connected (over a switching path) to a VRRP – enabled gateway, set up by our hosting center provider where the units are installed.

Only one unit can be active (and retain the active state) in a stable way at a given time – meaning that as soon as I attempt to manually trigger failover from the currently active unit to the standby unit, the standby unit (gaining the ACTIVE role) will not be able to route traffic and will fall back to the standby role automatically within minutes – the standby unit is not able to retain stably the active state (even if L3 connectivity should be available, according to the hosting center statement).

This scenario happens with Stateful Active / standby failover as well as with stateless Active / standby failover.

What is also interesting is that the 2 units are configured with 2 public IPs, namely X.X.X.1 and X.X.X.2.

At time t0 (before manual failover is triggered) -> both Public IPs are reachable and PRIMARY Unit is Active

At time t1 -> manual failover is triggered and X.X.X.1 is no longer reachable (assigned to the SECONDARY unit now ACTIVE) -> X.X.X.2 is assigned to the PRIMARY unit and is 100 % reachable

at time t2 -> cluster realizes that cannot route traffic via the SECONDARY unit and falls back to the former scenario (PRIMARY unit is ACTIVE and SECONDARY unit is STANDBY) -> X.X.X.2 is no longer reachable; X.X.X.1 is again reachable;

at time t2 + approx. 30 minutes -> X.X.X.2 becomes again reachable.

The 2 units outside interfaces are connected to the same segment with VRRP settings. According to the hosting site, there are no particular L2 settings in place concerning L2 ports connectivity upstream.

It looks like the cluster is able to route traffic (in a stable way, that is with the unit staying permanently in "Active" status) only when connected to one of the 2 VRRP enabled ports – the behavior described above happens no matter which ASA is connected to THAT port, and always follows the same pattern.

Hosting center has checked the VRRP settings (as well as L2 / switching path) and wasn't able to find any significant issues that might justify this behavior.

In the case of Stateful failover, I've also observed several "Route Session" errors in the LU updates exchanged between the 2 ASAs:

on the PRIMARY ASA:

ASA-fw# show failover<br/>
Failover On<br/>
Failover unit Primary<br/>
Failover LAN Interface: failover GigabitEthernet0/5 (up)<br/>
Unit Poll frequency 3 seconds, holdtime 10 seconds<br/>
Interface Poll frequency 3 seconds, holdtime 15 seconds<br/>
Interface Policy 1<br/>
Monitored Interfaces 3 of 114 maximum<br/>
failover replication http<br/>
Version: Ours 9.1(1), Mate 9.1(1)<br/>
Last Failover at: 09:14:07 UTC Jan 25 2016<br/>
        This host: Primary - Active<br/>
                Active time: 415235 (sec)<br/>
                slot 0: ASA5512 hw/sw rev (1.0/9.1(1)) status (Up Sys)<br/>
                  Interface outside (X.X.X.195): Normal (Monitored)<br/>
                  Interface inside (192.168.1.1): Normal (Monitored)<br/>
                  Interface management (192.168.99.1): Normal (Monitored)<br/>
        Other host: Secondary - Standby Ready<br/>
                Active time: 3090024 (sec)<br/>
                slot 0: ASA5512 hw/sw rev (1.0/9.1(1)) status (Up Sys)<br/>
                  Interface outside (X.X.X.200): Normal (Monitored)<br/>
                  Interface inside (192.168.1.3): Normal (Monitored)<br/>
                  Interface management (192.168.99.2): Normal (Monitored)<br/>
<br/>
Stateful Failover Logical Update Statistics<br/>
        Link : failover GigabitEthernet0/5 (up)<br/>
        Stateful Obj    xmit       xerr       rcv        rerr<br/>
        General         1518905    0          56441      9<br/>
        sys cmd         54791      0          54790      0<br/>
        up time         0          0          0          0<br/>
        RPC services    0          0          0          0<br/>
        TCP conn        1126177    0          156        0<br/>
        UDP conn        140986     0          889        0<br/>
        ARP tbl         190094     0          606        0<br/>
        Xlate_Timeout   0          0          0          0<br/>
        IPv6 ND tbl     0          0          0          0<br/>
        VPN IKEv1 SA    4          0          0          0<br/>
        VPN IKEv1 P2    8          0          0          0<br/>
        VPN IKEv2 SA    0          0          0          0<br/>
        VPN IKEv2 P2    0          0          0          0<br/>
        VPN CTCP upd    0          0          0          0<br/>
        VPN SDI upd     0          0          0          0<br/>
        VPN DHCP upd    0          0          0          0<br/>
        SIP Session     0          0          0          0<br/>
        Route Session   6842       0          0          9<br/>
        User-Identity   3          0          0          0<br/>
        CTS SGTNAME     0          0          0          0<br/>
        CTS PAC         0          0          0          0<br/>
        TrustSec-SXP    0          0          0          0<br/>
        IPv6 Route      0          0          0          0<br/>
<br/>
        Logical Update Queue Information<br/>
                        Cur     Max     Total<br/>
        Recv Q:         0       12      135522<br/>
        Xmit Q:         0       31      1717200<br/>
<br/>
ASA-fw#<br/>

on the SECONDARY ASA:

ASA-fw# show failover<br/>
Failover On<br/>
Failover unit Secondary<br/>
Failover LAN Interface: failover GigabitEthernet0/5 (up)<br/>
Unit Poll frequency 3 seconds, holdtime 10 seconds<br/>
Interface Poll frequency 3 seconds, holdtime 15 seconds<br/>
Interface Policy 1<br/>
Monitored Interfaces 3 of 114 maximum<br/>
failover replication http<br/>
Version: Ours 9.1(1), Mate 9.1(1)<br/>
Last Failover at: 09:14:07 UTC Jan 25 2016<br/>
        This host: Secondary - Standby Ready<br/>
                Active time: 3090024 (sec)<br/>
                slot 0: ASA5512 hw/sw rev (1.0/9.1(1)) status (Up Sys)<br/>
                  Interface outside (X.X.X.200): Normal (Monitored)<br/>
                  Interface inside (192.168.1.3): Normal (Monitored)<br/>
                  Interface management (192.168.99.2): Normal (Monitored)<br/>
        Other host: Primary - Active<br/>
                Active time: 415193 (sec)<br/>
                slot 0: ASA5512 hw/sw rev (1.0/9.1(1)) status (Up Sys)<br/>
                  Interface outside (X.X.X.195): Normal (Monitored)<br/>
                  Interface inside (192.168.1.1): Normal (Monitored)<br/>
                  Interface management (192.168.99.1): Normal (Monitored)<br/>

Stateful Failover Logical Update Statistics<br/>
        Link : failover GigabitEthernet0/5 (up)<br/>
        Stateful Obj    xmit       xerr       rcv        rerr<br/>
        General         408964     0          1527527    6852<br/>
        sys cmd         69155      0          69153      0<br/>
        up time         0          0          0          0<br/>
        RPC services    0          0          0          0<br/>
        TCP conn        258657     0          1127204    0<br/>
        UDP conn        34851      0          140988     0<br/>
        ARP tbl         44206      0          190171     0<br/>
        Xlate_Timeout   0          0          0          0<br/>
        IPv6 ND tbl     0          0          0          0<br/>
        VPN IKEv1 SA    30         0          4          0<br/>
        VPN IKEv1 P2    8          0          4          0<br/>
        VPN IKEv2 SA    0          0          0          0<br/>
        VPN IKEv2 P2    0          0          0          0<br/>
        VPN CTCP upd    0          0          0          0<br/>
        VPN SDI upd     0          0          0          0<br/>
        VPN DHCP upd    0          0          0          0<br/>
        SIP Session     0          0          0          0<br/>
        Route Session   2010       0          0          6852<br/>
        User-Identity   47         0          3          0<br/>
        CTS SGTNAME     0          0          0          0<br/>
        CTS PAC         0          0          0          0<br/>
        TrustSec-SXP    0          0          0          0<br/>
        IPv6 Route      0          0          0          0<br/>
<br/>
        Logical Update Queue Information<br/>
                        Cur     Max     Total<br/>
        Recv Q:         0       20      1811827<br/>
        Xmit Q:         0       30      466771<br/>
<br/>
ASA-fw#<br/>

I was wondering if there are some debugs that might assist me in troubleshooting this issue and possibly help in identifying Hosting center related misconfig – issues.

Also, I wanted to ask details on the "Route Session" entry and the impact of several xerr/rerr in this field.

Lastly, could this be a SW bug ? I read on cisco's website of several bugs for ASA clustering under 9.1(1) release:

http://www.cisco.com/c/en/us/td/docs/security/asa/asa91/release/notes/asarn91.html

Best Answer

It turned out to be Port Security enabled on ISP access switches, which was blocking frames (following failover) for approx. 30 minutes. Thanks everyone and Ron Trunk especially