Ubuntu – DNS issue with Failover IP from Hetzner

domain-name-systemfailoverhetznerhigh-availabilityUbuntu

Assume we have two servers A and B with 'real' and external IPs and we can switch the so called 'failover ip' (W.X.Y.Z) to point to a specific external IP of A or B. This works from the 'outside' and was easily done.
As a background: the failover ip is configured as a new entry in /etc/network/interfaces :

auto eth0:0  
iface eth0:0 inet static
  address W.X.Y.Z
  netmask 255.255.255.224 

Now let us assume W.X.Y.Z is configured dynamically to use hardware A. Now I call 'curl domain.com' from B and it uses the correct failover ip W.X.Y.Z but then resolves somehow to the wrong external IP B (or localhost?) instead of using the configured one A:

Trying W.X.Y.Z ...
* connect to W.X.Y.Z port 443 failed: Connection refused
* Failed to connect to domain.com port 443: Connection refused
* Closing connection 0
curl: (7) Failed to connect to domain.com port 443: Connection refused

When I start the local nginx it can successfully curl domain.com

Do I need to configure DNS locally somehow? How can I find out more about the DNS chain?

Using mtr just prints domain.com if trying this from server B

Is this related to this question?

The failover IP is W.X.Y.Z and is also the A record of domain.com

The /etc/hosts file for both nodes serverA and serverB looks like:

    127.0.0.1       localhost
    127.0.1.1       luminarhost            
    xxx    serverA
    xxx    serverB        

The /etc/network/interfaces of serverA

    ### Hetzner Online AG - installimage
    # Loopback device:
    auto lo
    iface lo inet loopback

    # device: eth0
    auto  eth0
    iface eth0 inet static
      address   xxx
      broadcast xxx
      netmask   xxx
      gateway   xxx
      # default route to access subnet
      up route add -net xxx netmask 255.255.255.224 gw xxx eth0

    iface eth0 inet6 static
      address xxx
      netmask xxx
      gateway xxx

    # failover ip
    auto eth0:0
    iface eth0:0 inet static
      address W.X.Y.Z
      netmask 255.255.255.224

and of serverB it is:

    ### Hetzner Online AG - installimage
    # Loopback device:
    auto lo
    iface lo inet loopback

    # device: eth0
    auto  eth0
    iface eth0 inet static
      address   xxx
      broadcast xxx
      netmask   xxx
      gateway   xxx
      # default route to access subnet
      up route add -net xxx netmask 255.255.255.192 gw xxx eth0

    iface eth0 inet6 static
      address xxx
      netmask xxx
      gateway xxx

    # failover ip
    auto eth0:0
    iface eth0:0 inet static
      address W.X.Y.Z
      netmask 255.255.255.224

Best Answer

  • As promised, here goes my answer:

  • Full disclosure: I'm not working for Hetzner, but worked for different companies in the past and present who used to colocate hardware at Hetzner.

  • In case the location inside your profile is correct, and you need support: I'm based in the same city, and could offer a hand, or two.

  • For all the people who never dealt with Hetzner: They're filtering network access etc., which means, especially regarding their failover IPs (IPs which are usable on different machines to provide some sort of high availability), that they're sending traffic directed to a specific IP to a specific MAC.

  • If one wants to change the target (the machine) the traffic is directed to, one has to sent a POST request to an API which is served via HTTPS. The API then validates authentication (which is a username and a corresponding password) and the request, and, if valid, propagates this new config to various routers in the network. This technique is similar to the one used by OVH, a big provider based in France.

  • Caveat: Albeit people use these IPs to provide some sort of high availability (as written) for their machines / services, the propagation of the new routing config takes some time, sometimes up to ~ 60 seconds. This means, for example, if using some sort of automatic failover, that if a machine to which the traffic currently gets routed to, goes down, for a certain amount of time, which people will notice, the traffic just gets dropped, because the machine is down, up until the point in time when the new routing config is in place.
  • So far for the introduction, let's have a look at your specific problem:
  • As pointed out in the comments / chat, using auto eth0:0, will set up your failover IP at the interface eth0:0, as soon as the network gets started, normally at boot time. You've got two machines, with the same configuration, so this leads to the situation, that the same IP is active on two different machines (which isn't a no-go, but leads to the situation you're currently dealing with). Just a note: The syntax you're using, aliasing the same interface multiple times, is deprecated (but still working). The "new way" is described inside the Debian wiki (this link) as well, which just assigns multiple IPs to one interface.
  • So: You've got the IP assigned locally to both machines at the same time. curl inside your test case does the following: It resolves the given domain name to an IP, and then tries to connect to this IP at port 443. Because this IP is in any case assigned locally and therefore reachable, the packets never get send out to the network. If nginx (like in your test case) is not running locally at this time, you're just getting connection refused, which is totally fine and valid: "The IP is local, so lets send the traffic there". It will never send the packets to some router, which maybe has the information: "The traffic directed to this IP should go to this machine".
  • Now...actually I'm not entirely sure what you're after. Do you only want to understand whats happening? If so, I've tried to described this. Do you want do find / implement a way, which "solves" this situation? If the later, here are some thoughts:
  • Solution 1: Remove the directive auto eth0:0 (but leave the rest of the configuration of eth0:0 in place) from /etc/network/interfaces. Doing this, will not assign the IP to the machine. Doing this would be your task (a task of a script), which does ifup eth0:0 (and, again maybe, speaks to the API to ensure the traffic gets routed to the correct machine).
  • Solution 2, aka "automate all the things": Don't do manual failover, but implement a system which does this automatically, via heartbeats (to check the health) between both machines: Multiple solutions exist for this, for example the Virtual Router Redundancy Protocol and (full disclosure: my personal favorite, I'm using this since years in production for tasks like this): corosync and pacemaker, which is the de facto standard to set up clusters providing high availability under Linux. (Also, have a look at this.) If you want to try out the later way, the fine folks of Kumina developed (and published) a resource agent some years ago for exactly dealing with this situation at Hetzner. The resource agent takes care of updating the routing information via speaking to the API.
  • To come to an end (for now): I'm not entirely sure what you're after. I've tried to described the root cause of the problem you're facing right now. Additionally, I've tried to present some thoughts for possible solutions. In case I didn't got what you're trying to do, there are things which are left unclear or you've got additional questions: Please give feedback, I'm glad to help (or at least try to).
  • (Besides: Could you please move your configs etc. into your post, to keep all the stuff in one place, so this question could be of help in the future to other people?)