Intermittent NXDOMAIN responses for certain records with low TTLs

binddomain-name-system

We're experiencing a peculiar issue with our bind installation (version 9.8.4).

In this scenario, bind is configured as a caching name server for a small network. For the large majority of queries, everything works fine.

However, we've noticed that queries for some hosts that are configured with a very low TTL, we sometimes get NXDOMAIN responses even though the host name exists.

As an example, take www.cdn77.com—here's the output of dig when run on the name server itself:

$ dig www.cdn77.com

; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> www.cdn77.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34440
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 6, ADDITIONAL: 12

;; QUESTION SECTION:
;www.cdn77.com.         IN  A

;; ANSWER SECTION:
www.cdn77.com.      196 IN  CNAME   1669655317.rsc.cdn77.org.
1669655317.rsc.cdn77.org. 0 IN  A   185.59.220.12

;; AUTHORITY SECTION:
org.            170517  IN  NS  a2.org.afilias-nst.info.
org.            170517  IN  NS  c0.org.afilias-nst.info.
org.            170517  IN  NS  b0.org.afilias-nst.org.
org.            170517  IN  NS  d0.org.afilias-nst.org.
org.            170517  IN  NS  a0.org.afilias-nst.info.
org.            170517  IN  NS  b2.org.afilias-nst.org.

;; ADDITIONAL SECTION:
a0.org.afilias-nst.info. 170517 IN  A   199.19.56.1
a0.org.afilias-nst.info. 170517 IN  AAAA    2001:500:e::1
a2.org.afilias-nst.info. 170517 IN  A   199.249.112.1
a2.org.afilias-nst.info. 170517 IN  AAAA    2001:500:40::1
b0.org.afilias-nst.org. 170517  IN  A   199.19.54.1
b0.org.afilias-nst.org. 170517  IN  AAAA    2001:500:c::1
b2.org.afilias-nst.org. 170517  IN  A   199.249.120.1
b2.org.afilias-nst.org. 170517  IN  AAAA    2001:500:48::1
c0.org.afilias-nst.info. 170517 IN  A   199.19.53.1
c0.org.afilias-nst.info. 170517 IN  AAAA    2001:500:b::1
d0.org.afilias-nst.org. 170517  IN  A   199.19.57.1
d0.org.afilias-nst.org. 170517  IN  AAAA    2001:500:f::1

;; Query time: 42 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Dec  2 14:27:03 2015
;; MSG SIZE  rcvd: 487

And here's an example of when a NXDOMAIN response is returned:

$ dig www.cdn77.com

; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> www.cdn77.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 28771
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;www.cdn77.com.         IN  A

;; ANSWER SECTION:
www.cdn77.com.      327 IN  CNAME   1669655317.rsc.cdn77.org.

;; AUTHORITY SECTION:
cdn77.org.      59  IN  SOA ns1.cdn77.org. admin.cdn77.com. 1449062655 10800 180 604800 60

;; Query time: 34 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Dec  2 14:24:52 2015
;; MSG SIZE  rcvd: 115

We use Google's public name servers as forwarders, and they never seem to respond with NXDOMAIN:

$ dig www.cdn77.com @8.8.8.8

; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> www.cdn77.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35091
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.cdn77.com.         IN  A

;; ANSWER SECTION:
www.cdn77.com.      851 IN  CNAME   1669655317.rsc.cdn77.org.
1669655317.rsc.cdn77.org. 0 IN  A   185.59.220.11

;; Query time: 40 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Wed Dec  2 14:29:16 2015
;; MSG SIZE  rcvd: 85

The authoritive answer, by the way, looks like this:

$ dig 1669655317.rsc.cdn77.org @ns1.cdn77.org

; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> 1669655317.rsc.cdn77.org @ns1.cdn77.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11529
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;1669655317.rsc.cdn77.org.  IN  A

;; ANSWER SECTION:
1669655317.rsc.cdn77.org. 1 IN  A   185.59.220.12

;; Query time: 20 msec
;; SERVER: 37.235.105.100#53(37.235.105.100)
;; WHEN: Wed Dec  2 14:32:57 2015
;; MSG SIZE  rcvd: 58

Interestingly, even though the authorative TTL for the record is one, Google's public nameserver always reduces it to zero (see this article for an interesting read about this behavior). I don't think this has anything to do with the problem though, as the successful responses from our bind also show TTL zero.

I've increased bind's logging level, but find it very hard to identify any entries that might have something to do with the problem. Even with querylog activated, all that's visible is the query itself and resolver: debug 1: createfetch: 1669655317.rsc.cdn77.org A lines.

Any pointers towards how to better diagnose (or even solve) this issue would be greatly appreciated.

Best Answer

The problem is that the authoritative nameservers for cdn77.org fail to properly handle ECS (EDNS Client-Subnet) options when they contain an IPv6 client subnet, although they handle IPv4 client subnets just fine.

If you build dig with EDNS client-subnet support, you can check this on the command line; or you can use the online KeyCDN DNS Lookup tool to check this (select the details checkbox and de-select the recursive checkbox, and omit the @ before ns1 when you give it as Custom DNS):

$ dig 1669655317.rsc.cdn77.org @ns1.cdn77.org +subnet=2001:db8::1
; <<>> DiG 9.10.1 <<>> +additional 1669655317.rsc.cdn77.org @ns1.cdn77.org +subnet=2001:db8::1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 44989
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1680
; CLIENT-SUBNET: 2001:db8::1/128/0
;; QUESTION SECTION:
;1669655317.rsc.cdn77.org.  IN  A

;; AUTHORITY SECTION:
cdn77.org.      60  IN  SOA ns1.cdn77.org. admin.cdn77.com. 1449094813 10800 180 604800 60

;; Query time: 2 msec
;; SERVER: 37.235.105.100#53(37.235.105.100)
;; WHEN: Wed Dec 02 22:21:41 UTC 2015
;; MSG SIZE  rcvd: 132

The same query with an IPv4 client address works just fine:

$ dig 1669655317.rsc.cdn77.org @ns1.cdn77.org +subnet=192.0.2.1
; <<>> DiG 9.10.1 <<>> +additional 1669655317.rsc.cdn77.org @ns1.cdn77.org +subnet=192.0.2.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19104
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1680
; CLIENT-SUBNET: 192.0.2.1/32/32
;; QUESTION SECTION:
;1669655317.rsc.cdn77.org.  IN  A

;; ANSWER SECTION:
1669655317.rsc.cdn77.org. 1 IN  A   185.93.3.27

;; Query time: 2 msec
;; SERVER: 37.235.105.100#53(37.235.105.100)
;; WHEN: Wed Dec 02 22:42:13 UTC 2015
;; MSG SIZE  rcvd: 81

When you send your query to an IPv6 address for Google Public DNS, your client IP subnet is of course an IPv6 subnet, and when the authoritative server answers NXDOMAIN, the (cached?) answer for IPv6 clients is NXDOMAIN too. If you send your query to an IPv4 address for Google Public DNS, your client subnet is an IPv4 subnet, and you get the correct (possibly cached) answer.

Related Topic