Exchange 2016 Co-Existence – OWA Won’t login when behind a load balancer

exchange-2013exchange-2016load balancingwindows-server-2012-r2

I'm currently in the middle of developing a lab environment to migrate our mixed Exchange 2010 and 2013 environment to Exchange 2016.

I've been running 1x 2013 and 2x 2016 in co-existence in the lab for a few weeks now, and it's worked well – I've done testing for RPC, OWA, ActiveSync and EWS. Everything seems to work fine.

This week I've tried to finalise my deployment by introducing a load balancer in front, with a plan to provide round-robin LB between my 2013 and 2016 CAS nodes. From all I've researched 2013 and 2016 CAS are totally compatible for the CAS role in that they will gladly proxy between each other. And to be sure my manual testing via each CAS (without a load balancer) seems to show this.

But now that I've introduced HA Proxy in front, OWA seems to not want to work properly.

Symptoms

Here's some symptoms:

  • When I log in via 2016 OWA to a 2013 mailbox, it does login correctly, redirects me, and starts loading the inbox. But after a second or two of loading the inbox, I get redirected back to the OWA login page as if I'd never logged in.

  • I can repeat this multiple times and it seems to do the same thing, as long as I login via a 2016 CAS.

  • Sometimes when I login via 2016 it will just hang with "Still working on it". This is far less frequent then just logging me back out, though.

  • Some times it redirects back to the login page so quickly I can't even tell I logged in at all, and has no error message. It's just a flash then back to the login page.

  • I've tried from reasonably up to date versions of both Chrome and Firefox, on Mac and Windows 8.1. So it's not an end-device issue.

  • If I login to a 2013 mailbox via OWA on the 2013 CAS, even through the load balancer, it seems to login correctly first time.

  • I've got one IP on the load balancer which then proxies traffic to each CAS node in a round-robin fashion. This is what I use to test my LB. I've got other IPs pointing directly to individual CAS nodes.

  • If I login directly to a 2016 CAS node to a 2013 mailbox, this works fine. It's one one I login to a 2016 CAS via the load balancer that the issue appears.

  • Interestingly, I just noticed the exact same issues occur even if logging into a 2016 mailbox. So as long as I login to a 2016 CAS, it logs back out.

  • Once I login successfully via the 2013 CAS, I can refresh the page several times and the session will still stay open. To me this indicates that proxying must be working OK to some extent once the session is established, because my LB is round-robin so the traffic should be getting re-distributed over various CAS nodes each time I refresh the OWA page.

It seems pretty clear to me that there are others experiencing similar issues with Exchange 2010 – 2016 when behind a load balancer. Unfortunately this is our first time using a load balancer as we're moving from a single CAS to 3 with this upgrade to Exchange 2016.

I've found a fair few posts with similar symptoms:

What I've Tried

  • I've confirmed that all our SSL frontend (third-party CA) certificates match with the same thumprint and serial across all CAS nodes. I've read that this can impact things. I am using a production certificate to properly test my lab.

  • The backend certs are left as the default Microsoft Exchange certificate, and from what I've read this is fine. I actually tried using our wildcard cert on the backend anyway, and it seemed to break everything.

  • I've verified that when accessing the 2016 CAS without a load balancer, things seem to work fine. Based on this, I thinking if I was to enable full session persistence on the load balancer so a particular user only deals with one CAS, it would fix the problem. But then, I shouldn't have to do that. And it makes me worry there's a deeper error to solve.

  • I've enabled IIS Failed Request tracing. I've found some reports of 500 errors for powershell requests. But they seem related to failed health monitoring requests, and don't necessarily seem to line up time-wise with my OWA tests logging me back out.

  • I've done some basic checking via Wireshark to look for anything weird. I've noticed that a single OWA page load is definitely distributed at least over multiple CAS nodes. Sometimes both 2016 nodes, sometimes a 2016 and a 2013 node.

  • I did notice in my capture (on the load balancer) that during a single failed login attempt, I kept getting repeated 440 Login Timeout responses from the CAS nodes. In this case I got multiple of these, at least one from each CAS node.

  • There's so many individual HTTP requests after the initial login request, it's hard to pin-point where the problem is, but it seems like a bunch of OWA service.svc requests start to be sent by the browser, and at some point they all being to return the same 440 Login Timeout error as a response from the CAS node. Then eventually we are redirected back to the login page.

  • I haven't yet found anything through my research what is actually causing the 440 login timeouts, or if it's expected behaviour etc…

  • I've re-checked all my Virtual Directory settings against our production setup. They look OK.

EDIT: 2017-03-04

  • I have tried multiple different VirtualDirectory internal and external URL combinations. exchange.isp.com.au is the internal AD Domain (this matches the live setup). The external URL for the lab is exchlab.isp.com.au.

  • I have tried:

    • exchlab.isp.com.au for both internal and external.
    • exchange.isp.com.au for both internal and external.
    • I stuck with exchlab for external and exchange for internal. None of these combinations made a difference.
  • Since then, I have also added Cookie-based session persistence in HA Proxy, and it seems to have made a difference. I have only done persistence for OWA and ECP, but the rest are not persistent. Things to note:

    • I am no longer logged back out of OWA repeatedly. It's become much more stable.
    • It still happens the first time I log out of a mailbox (e.g. 2013) and login to a mailbox of a different version. (e.g. 2016). After I get logged out the first time, though, I can now log back in successfully.
    • If I log out of a mailbox and back in too fast (within 10 seconds) then I am often kicked back out instantly.
  • I also decided to proceed with my testing of HA Proxy for services besides OWA. And I can confirm:

    • Outlook Anywhere load balances over all 3 CAS nodes and works fine.
    • EWS also load balances over all 3 and works fine.
    • Active Sync seems to only want to go over the 2x 2016 CASes but also load balances fine.

Things I need to Try

I haven't actually done these yet as I feel like there's an actual configuration issue. Also since everything seems to work without the load balancer, I can't see this helping.

More Environment Info

Some more details on the lab environment I'm testing in.

Relevant bits of my load balancer config:

    # This is the L7 HTTPS Front End
    # ------------------------------
    # We redirect to different backends depending on the URI
    # Each backend has its own separate health checks. So that each service can fail on an Exchange CAS node without affecting the other services.
    frontend ft_ISP_EXCHANGE_https
      bind 172.16.10.7:80 name INT_http
      bind 172.16.10.7:443 name INT_https ssl crt wildcard.isp.com.au.pem # Specify SSL Cert for offloading.
      mode http
      option http-keep-alive
      option prefer-last-server
      no option httpclose
      no option http-server-close
      no option forceclose
      no option http-tunnel
      timeout client 600s
      log global
      capture request header Host len 32
      capture request header User-Agent len 64
      capture response header Content-Length len 10
      # log-format directive must be written on a single line
      # it is splitted for documentation convnience
      log-format %ci:%cp\ [%t]\ %ft\ %b/%s\ %Tq/%Tw/%Tc/%Tr/%Tt\ %ST\ %B\ %CC\ %CS\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %hr\ %hs\ {%sslv/%sslc/%[ssl_fc_sni]/%[ssl_fc_session_id]}\"%[capture.req.method]\ %[capture.req.hdr(0)]%[capture.req.uri]\ HTTP/1.1
      maxconn 1000
      acl ssl_connection ssl_fc # Set ACL ssl_connection if ssl_fc returns TRUE

      # Route request to a different backend depending on the path:
      # http://serverfault.com/questions/127491/haproxy-forward-to-a-different-web-server-based-on-uri
      acl host_mail hdr(Host) -i exchange.isp.com.au
      acl path_slash path /
      acl path_autodiscover path_beg -i /Autodiscover/Autodiscover.xml
      acl path_activesync path_beg -i /Microsoft-Server-ActiveSync
      acl path_ews path_beg -i /ews/
      acl path_owa path_beg -i /owa/
      acl path_oa path_beg -i /rpc/rpcproxy.dll
      acl path_ecp path_beg -i /ecp/
      acl path_oab path_beg -i /oab/
      acl path_mapi path_beg -i /mapi/
      acl path_check path_end -i HealthCheck.htm
      # HTTP deny rules
      http-request deny if path_check
      # HTTP redirect rules
      http-request redirect scheme https code 302 unless ssl_connection # Force SSL
      http-request redirect location /owa/ code 302 if path_slash host_mail # Redirect / to /owa
      # HTTP routing rules -- This is where we decide where to send the request
      # Based on HTTP path.
      use_backend bk_ISP_EXCHANGE_https_autodiscover if path_autodiscover
      use_backend bk_ISP_EXCHANGE_https_ews if path_ews
      # other services go here
      default_backend bk_ISP_EXCHANGE_https_default



      # Backends
    # --------
    # Now we define each backend individually
    # Most of these backends will contain all the same Exchange CAS nodes, pointing to the same IPs
    # The reason we define each backend individually is because it allows us to do separate Health Checks
    # for each individual service running on each CAS node.

    # The failure of one service on a CAS node does not exclude that CAS node from participating in other
    # types of requests. This gives us better overall high-availability design.

    # HTTPS OWA
    # I have added additional comments on this definition, but the same applies to all Exchange HTTPS backends.
    backend bk_ISP_EXCHANGE_https_owa
      balance roundrobin # Use round-robin load balancing for requests
      option http-keep-alive # Enable HTTP Keepalives for session re-use and lowest latency for requests.
      option prefer-last-server # Prefer to keep related connections for a session on the same server.
      # This is not the same as persistence, and is mainly to try and take advantage of HTTP Keepalives.
      # See here for an example of why this is needed: http://stackoverflow.com/questions/35162527/haproxy-keep-alive-not-working-as-expected

      no option httpclose # Disable all options that are counter to keepalives
      no option http-server-close
      no option forceclose
      no option http-tunnel
      mode http # Operate in L7 HTTP Mode (vs TCP mode etc)
      log global
      option httplog
      option forwardfor # Enable insertion of the X_FORWARDED_FOR HTTP Header
      option httpchk GET /owa/HealthCheck.htm # Use L7 HTTP Health Check. This is recommended by Microsoft.
      http-check expect string 200\ OK
      default-server inter 3s rise 2 fall 3
      timeout server 60s
      # Define CAS Nodes for this service. We're using SSL offloading to allow L7 HTTP Checks
      # We've avoided SSL Bridging as that would halve our LB's throughput.
      server ISP_exch16_syd_01 172.16.10.2:80 maxconn 1000 weight 10 check
      server ISP_exch16_syd_02 172.16.10.3:80 maxconn 1000 weight 10 check
      server ISP_exch13 172.16.10.1:80 maxconn 1000 weight 10 check

Best Answer

I ended up reasonably working around this issue. Several posts had mentioned for similar issues that adding Cookie-based persistence on their load balancer had solved the issue for them.

I had resisted this "because I shouldn't have to", all the technical articles from Microsoft say this is no longer needed, and indeed not recommended.

But I caved, and ended up adding persistence for both OWA and ECP. The result was the issue was not completely resolved, but it became barely noticeable. The only time I experience issues is if I log out of OWA on one mailbox, and then immediately log into another. Even then, it kicks you out once, but if you try to login again it works fine.

There is no ongoing issue after this. Furthermore, I noticed the remaining issue only when moving from a 2013 to 2016 mailbox.

This isn't something our end users are likely to do, and we're nearly finished moving all our mailboxes to 2016. So I'm considering this work around "good enough".

Since we're using HA Proxy, it wasn't a big deal to add some layer 7 cookie-based persistence in the config. In fact, it only took about 5 minutes all up to figure out:

# HTTPS Outlook Web App (OWA)
backend bk_EXCHANGE_https_owa
  cookie LB01 insert indirect nocache # <-- Added this
  balance roundrobin
  mode http

  ...

  # Added the "cookie <name>" at the end of each CAS node definition
  # These names must be unique to each node.
  server n1_exch16_syd_01 172.16.10.2:80 maxconn 1000 weight 10 check cookie EXCH16-SYD-01
  server n1_exch16_syd_02 172.16.10.3:80 maxconn 1000 weight 10 check cookie EXCH16-SYD-02
  server n1_exch13 172.16.10.1:80 maxconn 1000 weight 10 check cookie EXCH13-SYD