Why does this DNS lookup fail for me but work for others

domain-name-system

Day One

I have to hide the actual host names, so I'm hoping there is still enough information to answer this question…

I'm trying to resolve a certain host name (let's pretend it's www.example.com, but this is not the actual host name). A simple dig request works, but when I try to do a series of dig starting from a root nameserver, I hit a dead-end. Here's an example:

# Starting with arbitrarily-chosen root nameserver
$ dig @198.41.0.4 www.example.com
(returns the usual list of TLD .com nameservers)

# Using a.gtld-servers.net
$ dig @192.5.6.30 www.example.com
(returns a list of 5 example.com authorities)

At this point, I tried each of the 5 example.comauthorities. Three of them fail with status SERVFAIL, and the remaining two time out. Here's a SERVFAIL example:

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 33577
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.example.com.       IN  A

;; Query time: 74 msec
;; SERVER: <intentionally removed>
;; WHEN: Tue Mar  8 10:10:33 2011
;; MSG SIZE  rcvd: 37

I tried this multiple times, from my own machine at home and from a remote machine in our co-lo, and both machines consistently get the same results.

However,

  • As I mentioned above, dig www.example.com (without specifying an @server) works fine.
  • This DNS trace utility is able to resolve the host name, and it clearly shows that it's using one of the name servers that times out for me!

Can anybody help me figure out what's going on?

EDIT 1: In case it helps, what should happen is that this host name should ultimately resolve to a CNAME record pointing to www.example.com.edgesuite.net, which should in turn resolve to another CNAME record pointing to an Akamai edge server.

EDIT 2: Per Joris's recommendation, I ran dig +trace www.example.com, and it actually failed to find a result. It gets to the same list of example.com authorities that I found before, and stops there.

Caching seems like a very likely culprit (and I did think of this earlier), but the weird part is that the actual host name isn't that popular. Would it be cached on two different ISP local nameservers if I'm the first person to request it? 🙂


Day Two

OK, I've discovered a few things:

  1. The two example.com authorities that I thought were timing out (as opposed to the other three, that were returning SERVFAIL) are not actually timing out. They just require a much longer timeout. If I use dig +time=10, for example, then I do eventually get back a result.
  2. I've tried this from several servers around the U.S., and the story is the same — using dig www.example.com returns a result very quickly, but dig @ns1.example.com (or @ns2.example.com) requires using a large timeout parameter.

So my new questions are:

  1. Could the result really be cached on a variety of proxying DNS servers, even though it's not a commonly-used host name? The TTL is 54,000 (or 15 hours, if I understand correctly).
  2. If not, then is it possible that ns1.example.com is somehow configured to return a result more quickly to proxying DNS servers than to my own dig queries (some kind of white list)? Or is that just crazy talk?

Best Answer

Don't obscure your DNS data when asking for help with DNS problems. It's pointless and silly, and this is a classic example of how it has served to obscure the actual problem that you have.

There are two major possibilities here:

  • You have intermittent connectivity to the content DNS servers. A common cause of such problems is an IP traffic routing problem, or a simple case of there being too many hops between you and them. Find out the IP addresses of the 5 content DNS servers in question, and use traceroute or some such to determine that you really do have IP connectivity to them all. Test UDP/IP connectivity to port 53, specifically, if your tool is capable of it.
  • The answer was provided in one of the paths that you didn't take when doing things manually, and that your resolving proxy DNS server only takes sometimes. The unfortunate truth about DNS query resolution is that there are a lot more paths that it can take down the tree of the DNS namespace than one might think from the superficial explanations of the process that exist. It's possible, for example, that the first CNAME resource record set (which you cannot obtain) was served up, and then cached by your resolving proxy DNS server, when mapping the names of some of the higher level content DNS servers to addresses. Given that your resolving proxy DNS server sometimes works, you can find out how it discovered the answer by looking at its query/response logs. (Some DNS server softwares have to have this logging explicitly turned on. Some have it on by default. How to do it for your particular software, which you haven't stated, is the domain of a separate question.)

Note that the only caching that occurrs here is local, on your resolving proxy DNS server. The content DNS servers that you are querying don't cache. (Or, more strictly speaking, if they do cache they cache the back-end databases that they are working from, in ways that have little to nothing to do with resource record TTLs, and that aren't publicly visible through the DNS protocol.)

There are a couple of minor and fairly unlikely possibilities as well, including daft firewalls at your site rewriting DNS traffic on the fly as it passes through them. But since you've not provided proper data, there's little more to narrow down and rule-out the possibilities that you can obtain from random passers-by out on Internet.