Edit: I'll keep my original answer below, but I'll try to explain what's happening here and provide a general solution for you.
Edit 2: Provided another option.
The problem that you're hitting here has to do with how the kernel manages I/O. When you make a write to your filesystem, that write isn't immediately committed to disk; that would be incredibly inefficient. Instead, writes are cached in an area of memory referred to as the page cache, and periodically written in chunks out to disk. The "dirty" section of your log describes the size of this page cache that hasn't been written out to disk yet:
dirty:123816kB
So what empties this dirty cache? Why isn't it doing it's job?
'Flush' on Linux is responsible for writing dirty pages out to disk. It's a daemon that wakes up periodically to determine if writes to disk are required, and, if so, performs them. If you are a C type of guy, start here. Flush is incredibly efficient; it does a great job of flushing stuff to disk when needed. And it's working exactly how it is supposed to.
Flush runs outside of your LXC container, since your LXC container doesn't have its own kernel. LXC containers exist as a construct around cgroups, which is a feature of the Linux kernel that allows better limitations and isolation of process groups, but not its own kernel or flush daemon.
Since your LXC has a memory limit lower than the memory the kernel has available, weird things happen. Flush assumes it has the full memory of the host to cache writes in. A program in your LXC starts to write a big file, it buffers...buffers...and eventually hits it's hard limit, and starts calling the OOM manager. This isn't a failure of any particular component; it's expected behavior. Kind of. This sort of thing should be handled by cgroups, but it doesn't seem like it is.
This completely explains the behavior you see between instance sizes. You'll start flushing to disk much sooner on the micro instance (with 512MB RAM) vs on a large instance
Ok, that makes sense. But it's useless. I still need to write me a big-ass file.
Well, flush isn't aware of your LXC limit. So instead of patching the kernel, there are a few options here for things you can try to tweak:
/proc/sys/vm/dirty_expire_centiseconds
This controls how long a page can be held in the dirty cache and written to disk. By default it's 30 seconds; try setting it lower to start pushing it out faster.
/proc/sys/vm/dirty_background_ratio
This controls what percentage of active memory flush is allowed to fill up before it starts forcing writes. There is a bit of fiddling that goes into sorting out the exact total here, but the easiest explanation is to just look at your total memory. By default it's 10% (on some distros it's 5%). Set this lower; it'll force writes out to disk sooner and may keep your LXC from running out of it's limits.
Can't I just screw with the filesystem a bit?
Well, yeah. But make sure you test this out.. you could affect performance. On your mounts in /etc/fstab where you'll be writing this to, add the 'sync' mount option.
Original answer:
Try reducing the blocksize used by DD:
dd if=/dev/zero of=test2 bs=512 count=1024000
You can only write one sector at a time (512 bytes on older HDDs, 4096
on newer). If DD is pushing writes to disk faster than the disk can
accept them, it will start caching the writes in memory. That's why
your file cache is growing.
I've now resolved this so I'm posting the solution in case others experience the same issue.
I neglected to mention that all of our web traffic goes over HTTPS, and that appears to be the cause. During a stall I used strace and pstack to see what one of the idle Apache processes was doing. It was stuck waiting on a mutex for the SSL session cache.
Looking at Apache config I noticed we had SSLSessionCache enabled with a timeout of 5 minutes. Disabling this is the fix.
My guess is that the session cache was filling up, then Apache was waiting for older sessions to time out before continuing.
Best Answer
CLOSE_WAIT means the other side closed the connection. The socket will be closed after the local program closes the socket descriptor. There is no time-out for CLOSE_WAIT, so process can be stuck with a socket in this state indefinitely. When you kill the process and its children, they close sockets and they get closed. Run
lsof
and see if the children have the sockets open. If they do, then it looks like a bug in their code.As for FIN_WAIT2, it's when local side waits for FIN,ACK from the other side to confirm closing the connection. However, there's a system-wide time-out on this state (see
/proc/sys/net/ipv4/tcp_fin_timeout
), which is by default 60s, so nothing should be stuck in this phase longer than a minute. BUT if seems that it's possible to code a program in such a way that a half-closed connection looks like an active one to the kernel, so the time-out won't kick in. Again, it would seem that you've found a bug.