Networking – Weighted Round Robins via TTL Feasibility

domain-name-systemload balancingnetworkinground-robinttl

I currently use DNS round robin for load balancing, which works great.
The records look like this (I have a TTL of 120 seconds)

;; ANSWER SECTION:
orion.2x.to.        116 IN  A   80.237.201.41
orion.2x.to.        116 IN  A   87.230.54.12
orion.2x.to.        116 IN  A   87.230.100.10
orion.2x.to.        116 IN  A   87.230.51.65

I learned that not every ISP / device treats such a response the same way.
For example some DNS servers rotate the addresses randomly or always cycle them through. Some just propagate the first entry, others try to determine which is best (regionally near) by looking at the IP address.

However if the user base is big enough (spreads over multiple ISPs, etc.) it balances pretty well.
The discrepancies from highest to lowest loaded server hardly every exceeds 15%.

However now I have the problem that I am introducing more servers into the systems, and that not all have the same capacities.

I currently only have 1 Gbps servers, but I want to work with 100 Mbps and also 10 Gbps servers too.

So what I want is I want to introduce a server with 10 Gbps with a weight of 100, a 1 Gbps server with a weight of 10 and a 100 Mbps server with a weight of 1.

I previously added servers twice to bring more traffic to them (which worked nice—the bandwidth almost doubled).
But adding a 10 Gbps server 100 times to DNS is a bit ridiculous.

So I thought about using the TTL.

If I give server A 240 seconds TTL and server B only 120 seconds (which is about about the minimum to use for round robin, as a lot of DNS servers set to 120 if a lower TTL is specified (so I have heard)). I think something like this should occur in an ideal scenario:

First 120 seconds
50% of requests get server A -> keep it for 240 seconds.
50% of requests get server B -> keep it for 120 seconds

Second 120 seconds
50% of requests still  have server A cached -> keep it for another 120 seconds.
25% of requests get server A -> keep it for 240 seconds
25% of requests get server B -> keep it for 120 seconds

Third 120 seconds
25% will get server A (from the 50% of Server A that now expired) -> cache 240 sec
25% will get server B  (from the 50% of Server A  that now expired) -> cache 120 sec
25% will have server A cached for another 120 seconds
12.5% will get server B (from the 25% of server B that now expired) -> cache 120sec
12.5% will get server A (from the 25% of server B that now expired) -> cache 240 sec

Fourth 120 seconds
25% will have server A cached -> cache for another 120 secs
12.5% will get server A (from the 25% of b that now expired) -> cache 240 secs
12.5% will get server B (from the 25% of b that now expired) -> cache 120 secs
12.5% will get server A (from the 25% of a that now expired) -> cache 240 secs
12.5% will get server B (from the 25% of a that now expired) -> cache 120 secs
6.25% will get server A (from the 12.5% of b that now expired) -> cache 240 secs
6.25% will get server B (from the 12.5% of b that now expired) -> cache 120 secs
12.5% will have server A cached -> cache another 120 secs
... I think I lost something at this point, but I think you get the idea...

As you can see this gets pretty complicated to predict and it will for sure not work out like this in practice. But it should definitely have an effect on the distribution!

I know that weighted round robin exists and is just controlled by the root server. It just cycles through DNS records when responding and returns DNS records with a set probability that corresponds to the weighting. My DNS server does not support this, and my requirements are not that precise. If it doesn't weight perfectly its okay, but it should go into the right direction.

I think using the TTL field could be a more elegant and easier solution—and it doesn't require a DNS server that controls this dynamically, which saves resources—which is in my opinion the whole point of DNS load balancing vs hardware load balancers.

My question now is: Are there any best practices / methods / rules of thumb to weight round robin distribution using the TTL attribute of DNS records?

Edit:

The system is a forward proxy server system.
The amount of Bandwidth (not requests) exceeds what one single server with Ethernet can handle.
So I need a balancing solution that distributes the bandwidth to several servers. Are there any alternative methods than using DNS?
Of course I can use a load balancer with fibre channel etc, but the costs are ridiculous and it also increases only the width of the bottleneck and does not eliminate it.
The only thing I can think of are anycast (is it anycast or multicast?) IP addresses, but I don't have the means to set up such a system.

Best Answer

First off, I completely agree with @Alnitak that DNS isn't designed for this sort of thing, and best practice is to not (ab)use DNS as a poor man's load balancer.

My question now is... are there any best prectices / methos / rules of thumb to weight round robin distribution using the TTL attribute of DNS records?

To answer on the premise of the question, the approach used to perform basix weighted round robin using DNS is to:

  • Adjust the relative occurrence of records in authoritative DNS responses. I.e. if Server A is to have 1/3 of traffic and Server B is to have 2/3, then 1/3 of authoritative DNS responses to DNS proxies would contain only A's IP, and 2/3 of responses only B's IP. (If 2 or more servers share the same 'weight', then they can be bundled up into one response.)
  • Keep a low DNS TTL so that un-balanced load is evened out relatively quickly. Because the downstream DNS proxies have very un-even numbers of clients behind them, you'd want to re-shuffle records frequently.

Amazon's Route 53 DNS service uses this method.

The amount of Bandwidth (not requests) exceeds what one single server with ethernet can handle. So I need a balancing solution that distributes the bandwidth to several servers.

Right. So as I understand this, you have some sort of 'cheap' downloads / video distribution / large-file download service, where the total service bitrate exceeds 1 GBit.

Without knowing the exact specifics of your service and your server layout, it's hard to be precise. But a common solution in this case is:

  • DNS round robin to two or more TCP/IP or HTTP level load balancer instances.
  • Each load balancer instance being highly available (2 identical load balancers cooperating on keeping one IP address always on).
  • Each load balancer instance using weighted round robin or weighted random connection handling to the backend servers.

This kind of setup can be built with open-source software, or with purpose-built appliances from many vendors. The load balancing tag here is a great starting point, or you could hire sysadmins who have done this before to consult for you...