Linux – /etc/resolv.conf order not respected by `ping`

domain-name-systemlinux

CentOS 7. My problem is a seemingly common issue where nslookup can resolve a host, but ping can't. However, the common answers like messing with avahi or /etc/nsswitch.conf don't help because my VPS is running neither Avahi nor NetworkManager. (in other words, I can break /etc/nsswitch.conf by setting hosts: files and ping continues to work)

/etc/resolv.conf is as follows:

nameserver 10.44.13.246
nameserver 10.32.72.88
nameserver 10.32.72.86

Where the first nameserver points to an instance of dnsmasq that's running on another of my VPSes, and the last two are the hosting provider's DNS servers. I expect them to be queried in order (the last two are simply last-resort fallbacks).

Now, for any of the hosts defined in that dnsmasq instance, nslookup always works, and ping works some of the time — a host will resolve properly, then break, then a few minutes later it will be fine again. However, if I remove the upstream DNS servers in etc/resolv.conf like this,

nameserver 10.44.13.246
#nameserver 10.32.72.88
#nameserver 10.32.72.86

then ping immediately starts to work 100% of the time. This directly contradicts the resolv.conf docs, which say that in the absence of an option rotate directive, the servers are queried in order until one sends a response.

nscd is running and is being hit, because I can see the cache hit/miss counters go up for these problematic queries.

How can I resolve this?

Best Answer

I don't have a direct answer to the larger question but answers for some distinct parts of it.


Regarding ping vs nslookup

It's worth noting that ping is just an example of a regular program which uses the OS resolver library (ie, getaddrinfo/gethostbyname calls) while nslookup (as well as dig, etc) are DNS client programs making DNS queries of their own, rather than using the resolver library, they just so happen to pick up their default server from the configuration file for the system resolver as a matter of convenience.

What this means is that nslookup is bad for testing how the system resolver behaves (ie resolv.conf, nsswitch.conf, etc), while eg ping is bad for testing DNS.

It can be noted that in Linux-land I would consider getent ahosts (eg getent ahosts www.example.com) a better choice for testing the resolver behavior, and dig to be much preferrable over nslookup for testing DNS.


Regarding what you can do to see what is happening

As was suggested by Hangin on in quiet desperation, you may want to use strace (maybe also ltrace for a higher level view) and I would suggest using it with getent ahosts rather than ping to not get all the noise of what is ping's actual purpose, while you're trying to observe what is just a side-effect. getent ahosts just does this one thing that you're trying to investigate.


Regarding what to have in resolv.conf

What you're saying about things "breaking" when the "wrong" server is queried makes me wonder why you are putting all of those servers in resolv.conf in the first place. It's generally really not a good idea to put servers with different behavior (different in some way that is actually significant to your use) all in the list.