Intermittent recursive/iterative DNS query failure

domain-name-system

I have a problem issuing queries to a DNS and I'm not sure where to look for the underlying cause.

I have a record "www.alumninews.uottawa.ca" which is a CNAME record which points to an A record for "uottawa.mailoutinteractive.com" which I host. When I query my ISP's DNS servers, I get different responses:

The first does not recurse

$ dig +recurse www.alumninews.uottawa.ca @64.59.184.13

; <<>> DiG 9.8.1-P1 <<>> +recurse www.alumninews.uottawa.ca @64.59.184.13
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 13260
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.alumninews.uottawa.ca. IN  A

;; ANSWER SECTION:
www.alumninews.uottawa.ca. 3600 IN  CNAME   uottawa.mailoutinteractive.com.

;; Query time: 139 msec
;; SERVER: 64.59.184.13#53(64.59.184.13)
;; WHEN: Wed Apr  3 11:33:55 2013
;; MSG SIZE  rcvd: 87

Note that the CNAME does not get resolved (more on that below).

The second resolves the CNAME correctly (note the TTL is now 3532, not the default 3600 above):

$ dig +recurse www.alumninews.uottawa.ca @64.59.184.13

; <<>> DiG 9.8.1-P1 <<>> +recurse www.alumninews.uottawa.ca @64.59.184.13
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16716
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.alumninews.uottawa.ca. IN  A

;; ANSWER SECTION:
www.alumninews.uottawa.ca. 3532 IN  CNAME   uottawa.mailoutinteractive.com.
uottawa.mailoutinteractive.com. 300 IN  A   209.15.195.166

;; Query time: 30 msec
;; SERVER: 64.59.184.13#53(64.59.184.13)
;; WHEN: Wed Apr  3 11:35:03 2013
;; MSG SIZE  rcvd: 103

Further, when I capture the network traffic with wireshark, I'm seeing that the error when looking up uottawa.mailoutinteractive.com is "Reply code: No such name (3)" on the failed recursion:

Domain Name System (response)
[Request In: 3993]
[Time: 0.057954000 seconds]
Transaction ID: 0xf07c
Flags: 0x8183 Standard query response, No such name
    1... .... .... .... = Response: Message is a response
    .000 0... .... .... = Opcode: Standard query (0)
    .... .0.. .... .... = Authoritative: Server is not an authority for domain
    .... ..0. .... .... = Truncated: Message is not truncated
    .... ...1 .... .... = Recursion desired: Do query recursively
    .... .... 1... .... = Recursion available: Server can do recursive queries
    .... .... .0.. .... = Z: reserved (0)
    .... .... ..0. .... = Answer authenticated: Answer/authority portion was not authenticated by the server
    .... .... ...0 .... = Non-authenticated data: Unacceptable
    .... .... .... 0011 = Reply code: No such name (3)
Questions: 1
Answer RRs: 1
Authority RRs: 0
Additional RRs: 0
Queries
    www.alumninews.uottawa.ca: type A, class IN
        Name: www.alumninews.uottawa.ca
        Type: A (Host address)
        Class: IN (0x0001)
Answers
    www.alumninews.uottawa.ca: type CNAME, class IN, cname uottawa.mailoutinteractive.com
        Name: www.alumninews.uottawa.ca
        Type: CNAME (Canonical name for an alias)
        Class: IN (0x0001)
        Time to live: 1 hour
        Data length: 32
        Primaryname: uottawa.mailoutinteractive.com

A successful lookup looks like this in Wireshark (this is a different domain with the same problem):

Domain Name System (response)
[Request In: 70]
[Time: 0.051422000 seconds]
Transaction ID: 0x417d
Flags: 0x8180 Standard query response, No error
    1... .... .... .... = Response: Message is a response
    .000 0... .... .... = Opcode: Standard query (0)
    .... .0.. .... .... = Authoritative: Server is not an authority for domain
    .... ..0. .... .... = Truncated: Message is not truncated
    .... ...1 .... .... = Recursion desired: Do query recursively
    .... .... 1... .... = Recursion available: Server can do recursive queries
    .... .... .0.. .... = Z: reserved (0)
    .... .... ..0. .... = Answer authenticated: Answer/authority portion was not authenticated by the server
    .... .... ...0 .... = Non-authenticated data: Unacceptable
    .... .... .... 0000 = Reply code: No error (0)
Questions: 1
Answer RRs: 2
Authority RRs: 0
Additional RRs: 0
Queries
    www.bulletinsanciens.uottawa.ca: type A, class IN
        Name: www.bulletinsanciens.uottawa.ca
        Type: A (Host address)
        Class: IN (0x0001)
Answers
    www.bulletinsanciens.uottawa.ca: type CNAME, class IN, cname uottawa.mailoutinteractive.com
        Name: www.bulletinsanciens.uottawa.ca
        Type: CNAME (Canonical name for an alias)
        Class: IN (0x0001)
        Time to live: 41 minutes, 26 seconds
        Data length: 32
        Primaryname: uottawa.mailoutinteractive.com
    uottawa.mailoutinteractive.com: type A, class IN, addr 209.15.195.166
        Name: uottawa.mailoutinteractive.com
        Type: A (Host address)
        Class: IN (0x0001)
        Time to live: 5 minutes
        Data length: 4
        Addr: 209.15.195.166 (209.15.195.166)

Uottawa's DNS servers are configured not to return recursive query information, so my understanding is that my ISP will do a second query to resolve the CNAME. But I don't know why it is failing once and then succeeeding a second time. It seems to me to be a problem between our ISP (Shaw) and Route53, where my DNS is hosted.

I also notice that it often continues to fail—I can continue to execute the failing dig command for quite a while before it succeeds again.

I've gotten this far but don't know how to debug this any further. Any idea where this is failing?

Best Answer

The packet capture isn't revealing anything that your dig queries did not. Reply code: No such name (3) is a longwinded way of saying NXDOMAIN (RCODE 3), the latter of which is more meaningful to DNS administrators. I will not remove the packet capture from your post, but it will be less of a wall of text for others to sift through if you find yourself agreeing with me on this point.

A response of NXDOMAIN is problematic; it is an indication of a successful lookup by the your ISP's recursive nameservers. It's bad behavior from your perspective because the record is missing, but the way in which it failed tells a different story. Your ISP's servers are saying: "I talked to the authoritative nameservers, received a successful reply, and they told me the record didn't exist". This is quite different than SERVFAIL, which would indicate an actual communication problem.

The different responses between queries are most likely due to load balancing: there are multiple servers behind the IP address that you are querying. One of them has "negatively cached" the lookup failure and will not attempt the lookup again until the ncache interval for that domain expires. Another of their servers succeeded, and "positively cached" it, causing it to remember that answer for the duration of the TTL. (3532 means 68 seconds have elapsed since that event, 3532+68 = 3600)

Conclusion

Due to the distributed nature of AWS, it will be extremely difficult for any of us to give you advice beyond this. I queried the four nameserver addresses that were served to me and found no problems.

If you see this issue again, you can try querying the A record directly to see if anything stands out:

dig www.alumninews.uottawa.ca @64.59.184.1
(+recurse is set by default and not necessary)

Your best bet is to ask your ISP to investigate further the next time it happens, but be prepared for a response of "our server is doing what it was told to do and we can't help you".