Freebsd – BIND 9.10 constantly killed on FreeBSD 10.0 with out of swap space

binddomain-name-systemfreebsdkill-processswap

In one of our slave DNS servers BIND, version bind910-9.10.0P2_3, constantly get killed with the following message in /var/log/messages:

Jul 30 01:00:10 cinnabar kernel: pid 602 (named), uid 53, was killed: out of swap space

This service runs on a FreeBSD 10.0 VM in XenServer 6.2, it has 512MB of system memory.

At this moment pstat -m -s return this:

Device          1M-blocks     Used    Avail Capacity
/dev/ada0p3           512        9      502     2%

I don't think it's a swap problem, it appears to be memory leakage, but I'm unsure.

EDIT: Access information.

This is one of two slave DNS servers, they only store the zones from the authoritative server and act as a recursive server for the internal users to the outside world. The number of clients is something between 700-1500 simultaneous users. Since we have a /21 internal space and a /23 public IPv4 space and there's no queries from the outside world, the port 53 is even blocked on the firewall to those machines.

Best Answer

If you have any kind of monitoring on this server, it would be nice to check if there are any peaks on memory usage right around the time processes get killed. Then you could try to find a correlation with number of requests, etc.

That being said, it could either mean there is indeed no memory left on the system but most likely Bind is requesting a contiguous area of memory, fragmentation is getting in the way and FreeBSD is trying to swap out some processes to make room for that. It probably can't swap out many pages, fails to allocate and triggers the out of memory killer.

If you have disk space, the easiest solution is to add more swap through a swap file (not need for a partition). Ideally, you should limit the cache size (Bind defaults to no unlimited), as suggested by HÃ¥kan, but that could have a performance impact. Without more statistics is really hard to tell. Even domestic routers have 512MB of RAM nowadays so you should consider increasing that (and limiting the cache) for a production server serving 700-1500 simultaneous users (which could translate in much more req/sec, again, without more information it's hard to tell).

You could also try tweaking the malloc implementation via the MALLOC_PRODUCTION knob, but I think that is too extreme in the face of easier solutions available.