Ssh – 408 errors from apache, “fork: cannot allocate memory” from dhclient and sshd

amazon ec2apache-2.4dhcpmemory leakssh

For the last three nights in a row, I've had an EC2 server start to give 408 errors in response to web requests. When I come in in the morning, I can't ssh in; I have to reboot using the management console. Both dhclient and sshd are giving error messages that say "fork: Cannot allocate memory".

As far as I can tell, this is only happening to one server. The details have been slightly different each time:

The first night, it first happened around 19:30 (according to /var/log/messages), but there was still a "bound to" message. Then from around 20:00 to 20:30, there are a lot of DHCPREQUESTs, and after that no successful binding. The sshd errors start at around 21:10 (according to /var/log/secure).

The second night, we see the DHCPREQUEST lines at 18:45 to 19:15, and the fork error starts after that. The sshd errors start at 18:20.

At this point I upgraded dhclient through yum, to see if that would help. (I hadn't seen the sshd errors at this point.) It didn't.

The third night looks like the first, with the fork error at 18:30 and the DHCPREQUESTs from 19:00 to 19:30. But then at 4:15 in the morning, the OOM killer comes in and kills off an httpd process. The OOM killer hadn't shown up the first two nights. The sshd errors start at 19:30 and there are a lot of "Received disconnect" errors at 4:15.

This thread on the AWS developer forums suggests that dhclient might have a memory leak in an environment variable, but I can't see it if so. This also doesn't seem to be a slow leak: it's been happening earlier each night, but I rebooted the server at 17:00 after upgrading dhclient, so it had been up for less than two hours the third time.

I've considered a memory leak from apache, but it doesn't seem to coincide with anything particular in the apache logs, and I haven't been able to trigger it by sending several simultaneous memory-intensive requests to the server. And in that case, I'd expect the OOM killer to have been involved all three nights.

There is one noteworthy thing in the apache logs, which is the timestamps on three consecutive lines: 24/Feb/2017:02:10:05, 23/Feb/2017:18:23:05, 24/Feb/2017:07:03:20. The second of those requests was a 500, not a 408. So I guess that request was somehow running for eight or more hours, and that might have been eating memory. There's nothing like that the first two nights.

Basically, I have no real idea what's going on. My current plan is to start up a new server in the same placement group, point the domain at that instead, and leave both running and see what happens. But I'm looking for suggestions for how I can diagnose and fix this.

Update

I've since had this triggered after installing a simple ps/cron monitor as suggested by user ochach. It seems that I was indeed running out of memory, with httpd being the culprit; I don't know why the OOM killer didn't run.

Best Answer

Install monitoring tools and check which process is memory hungry. From there you can try to isolate the issue when you will know which process got memory leak. Also check dmesg for any oom killed that kernel did.

To pinpoint the issue you can add "ps aux --sort -rss | head -n 10" to run every minute and append to file on nonephemeric device.

Other then that hack you can install seperate monitoring like nagios, prometeus or use sar/sysstat.