Cisco – SNAT vs PBR for Server Load Balancing

ciscocisco-6500load-balancernat;pbr

In an one-armed SLB configuration, SNAT is used to force return-traffic to pass through the SLB. This has one negative: simply that web logs cannot capture the true client IP unless passed in XFF (X-Forwarded-For) header and the web server is able to log.

An alternative is to use PBR (policy based routing) to get the return-traffic back to the SLB, but I try to avoid PBR unless there's no other/better solution On the 6500E platform with SUP720/PFC3B — and I know the particular IOS version can be a factor too — does PBR add any latency over doing SNAT assuming PBR is all done in hardware? If PBR is done in hardware using only the commands supported by it today, is it possible upgrading IOS in the future could change PBR to be done in software/process-switched?

Today, our load-balancers have most of the web server VLANs directly behind them — default g/w pointing to SLB — and other servers like SQL in non-SLB VLANs. However, this web-sql traffic transits the SLB. Our goal would be to avoid crossing the SLB and just keeping SQL traffic separate, and still retaining the true client in web logs. I'd prefer to not introduce troubleshooting complexity with PBR and possibly having this change from hardware to software processed in the future. Short of the XFF and SNAT mentioned earlier, is PBR the only option here and what's the best way to keep PBR tightly configured?

Best Answer

does PBR add any latency over doing SNAT assuming PBR is all done in hardware?

Sup720 supports PBR in HW, the additional latency (if any) is negligible because PBR doesn't add more interface queueing. I think PBR would make things harder than they need to be (and I'm still not even sure whether it would work... the specifics of that option aren't totally clear)

Short of the XFF and SNAT mentioned earlier, is PBR the only option here and what's the best way to keep PBR tightly configured?

PBR is not the only option. Your proposed option is a bit unclear, but PBR normally boils down to nothing more than a fancier way to do static routing.

Typically this is the best topology for load-balanced services that require SQL queries...

  • Put your VIPs on a front-side subnet... 172.16.1.0/24 in the diagram
  • Put your server pools in a back-side subnet... 172.16.2.0/24 in the diagram
  • Put your SQL queries on another interface... 172.16.3.0/24 in the diagram. Add a second interface to all servers that require SQL queries. Make all your SQL queries to addresses on this subnet. This topology works for both Unix and Windows, since you're only relying on ARP or host-routes (depending on your preference) for SQL connectivity.

Diagram:

LB w SQL query network

This topology has additional benefits:

  • It separates SQL interfaces from web traffic, since SQL queries are bursty and may wind up congesting web traffic.
  • It provides different MTUs for your load-balanced services (which usually need to stay at 1500 or lower, if they transit the internet) and your SQL services (which favor jumbo frames).

Any one-armed load-balancer topology is a less desirable option since you wind up cutting your max throughput in half because of the one-armed topology.

EDIT for question about HW vs SW switching on Sup720

This is a deep topic, but I will give the summary version... Sup720 applies an ACL in each direction (ingress / egress) and the ACL must fit into TCAM based on whatever merge algorithm the platform has chosen. Sup720's Feature Manager (i.e. fm) is responsible for mediating the features into TCAM and reporting whether you have a punt adjacency (meaning SW switching), or whether the combination of protocol and direction is switched in HW. To isolate whether

  1. First, identify all ingress and egress Layer3 interfaces that the PBR traffic could transit
  2. Next, examine the output of show fm fie int <L3_intf_name> | i ^Interf|Result|Flow|Config (you must look at both ingress and egress directions for all interfaces in Step 1). Your traffic will be HW switched if the values in CAPS match the values you see below... note that the output of the command I'm using is very similar to what you see in show fm fie summary...

Tx.Core01#sh fm fie int Vl220 | i ^Interf|Result|Flow|Config
Interface Vl220:
 Flowmask conflict status for protocol IP : FIE_FLOWMASK_STATUS_SUCCESS      <--- in HW
 Flowmask conflict status for protocol OTHER : FIE_FLOWMASK_STATUS_SUCCESS   <--- in HW
 Flowmask conflict status for protocol IPV6 : FIE_FLOWMASK_STATUS_SUCCESS    <--- in HW
Interface Vl220 [Ingress]:
 FIE Result for protocol IP : FIE_SUCCESS_NO_CONFLICT                        <--- in HW
 Features Configured : V4_DEF   - Protocol : IP
 FIE Result for protocol OTHER : FIE_SUCCESS_NO_CONFLICT                     <--- in HW
 Features Configured : OTH_DEF   - Protocol : OTHER
 FIE Result for protocol IPV6 : FIE_SUCCESS_NO_CONFLICT                      <--- in HW
 Features Configured : V6_DEF   - Protocol : IPV6
Interface Vl220 [Egress]:
 No Features Configured
No IP Guardian Feature Configured
No IPv6 Guardian Feature Configured
No QoS Feature Configured
Tx.Core01#

The interface above doesn't show egress output, but that's irrelevant... the output is similar to the Ingress direction. Ricky Micky wrote up an outstanding explanation of 'sh fm fie interface' if you would like more details on the dynamics of TCAM banks / merge results.

Related Topic