BIND – Slow Query Responses Isolated To Single Interface

binddomain-name-systemperformance

Currently experiencing slow query responses on a specific interface on my nameserver. I'm running BIND on a physical server with one network card. This network card is leveraged by the interface eth0, and also by the virtual interface eth0:1. They both have an address in the same subnet.

BIND is listening on all IPv4 interfaces, and has some very basic options set. There are no other performance / network related options set in any other included configuration file.

listen-on { any;};
listen-on-v6 port 53 { ::1; };
directory       "/var/named";
dump-file       "/var/named/data/cache_dump.db";
statistics-file "/var/log/named/named.stats";
memstatistics-file "/var/named/data/named_mem_stats.txt";

When I query against the address on the primary interface eth0, I get a delayed response of around three seconds or above normally. This even applies when querying from the box itself against the address (not the loopback). When querying the other private IP address assigned to the virtual interface eth0:1, no problem with performance is encountered and the response is always under one second.

Analysing performance statistics, it would seem that the box is not under load and memory isn't being maxed out. I've also got another nameserver set up as a slave to this one, on the same network with a near identical network setup bar addressing, and have no performance problems querying it's main interface (with it also having a virtual interface with identical configuration). Zones I'm querying for are authoritative, so there is no delay in looking up the records elsewhere. I'm also able to confirm that the query is received almost instantly by the server regardless of where it originates from, and the delay occurs between the query being received and a response being sent (identified through tcpdump).

If there's any information that would be useful to have, please rather than downvoting me for missing it in my post, please just leave a comment below and I'm happily provide any helpful details I can. Any suggestions on how best to troubleshoot a problem of this nature, or ideas on what the potential causes could be, would be very much appreciated.

BIND version is 9.3.6-P1-RedHat-9.3.6-25.P1.el5_11.11. I've recently updated to this, but I'm unsure whether these performance issues came about following the upgrade, or whether they existed prior to it.

EDIT: Dig output as requested. Removed domain name being queried and target server.

Also worth noting that sometimes the requests do just timeout completely. It's quite intermittent, with occasional replies under two seconds, but mostly over three with the occasional timeout as mentioned.

[root@hugh-host-01 ~]# dig REMOVED @REMOVED

; <<>> DiG 9.9.4-RedHat-9.9.4-38.el7_3 <<>> REMOVED @REMOVED
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52129
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 4
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;REMOVED.                      IN      A

;; ANSWER SECTION:
REMOVED.               5       IN      A       REMOVED

;; AUTHORITY SECTION:
REMOVED.               5       IN      NS      REMOVED.
REMOVED.               5       IN      NS      REMOVED.
REMOVED.               5       IN      NS      REMOVED.

;; ADDITIONAL SECTION:
REMOVED.           5       IN      A       REMOVED
REMOVED.           5       IN      A       REMOVED
REMOVED.           5       IN      A       REMOVED

;; Query time: 3633 msec
;; SERVER: REMOVED#53(REMOVED)
;; WHEN: Sat Jan 07 00:49:01 GMT 2017
;; MSG SIZE  rcvd: 155

Thanks for your time,

Hugh

Best Answer

This problem was caused by the iowait being maxed out on the server. It was running consistently at 100% with kjournald as the service causing it.

Thanks to a suggestion from Andrew B, I begun investigating UDP packet errors using netstat -su | grep errors. From this, I could see that it was shooting up by 30 - 50 packets every second roughly. This led me to check the buffer for UDP per socket by running netstat -uanp. From this, I was able to confirm that the random delays and occasional timeouts (drops) were occurring due to the buffer being full. I found out the buffer was full by analysing the value in the Recv-Q column for the BIND service listening on the IP / port in question.

After identifying the buffer was full, there wasn't much point in increasing it as it would no doubt become saturated once more. Instead, as the CPU load and RAM looked okay, I started wondering if disk operations could be causing a bottle-neck for UDP packets being dealt with. This was confirmed by running the commmand top and analysing the iowait value.

Once I identified that the CPU was waiting nearly 100% of the time for io operations to complete, I started using tools such as iotop to find what was writing to disk. Turned out the journal system for the ext3 file-system was generating all of the waiting. This led me to think that perhaps it was extreme amounts of logging on the server that could be causing the saturation, as I knew the /var/log/messages file was receiving a ton of denied query logs every second.

Testing the above theory, I added the following line to named.conf within the logging area. This line disables logging for approve / deny messages related to received queries. There is a log per query put into /var/log/messages, which can be a lot if you get barraged by clients:

category security { null; };

Fortunately, after restarting BIND, I could see a dramatic drop in iowait percentage. Testing queries, I was able to confirm that they are being answered well within a tenth of a second now; a dramatic improvement on before.

In hindsight I should have checked the iowait time initially. Hope this helps anyone experiencing similar issues. I'll now be looking into controlling the logging a little more and seeing what I can do about these denied messages.