Linux – Why is mongod not using all available RAM

disk-cachelinuxmongodb

We have an mongod instance runnning on a VM, and it doesn't seem to be using all available memory. It's page-faulting significantly more than usual, and the system's performance has been significantly degraded lately.

More specifically, if I htop mongod, I see:

  • VIRT: 3471G
  • RES: 11.8G

The VM has ~60 GB of memory, currently, ~4.6GB is "used", and the remainder is in buffers or cache.

My understanding is that mongod mmaps the database files. (This is why VIRT is huge.) However, we're not clear on why the RES number isn't closer to 60 GB: as mongod needs data off disk, this data should be brought into the processes RSS, no? Mongo reports that it is page-faulting, so one would assume that the RSS would grow over time; ours is holding steady.

There is nothing else significant running on this machine. (It's the database server.) What's consuming the rest of buffers and cache, and specifically, why is the RES size of mongod not expanding to fill available RAM?

Best Answer

This can be a long and involved process, but let me first say this as a starting point. I (and many others I have worked with) have managed to get far closer to maximum resident memory usage. Exactly what that maximum is will vary from system to system and has a lot of variables that come into play but I would generally shoot for 60-80%, anything higher is a bonus.

The next thing to do is some reading. There has been plenty written about this topic, often from the other perspective (better memory efficiency, fitting more into RAM when it is full etc.). For example:

With all that out of the way, you hopefully have a decent idea about how to tune your system to get the most out of the available memory (usually, but not always, knocking readahead down and making sure NUMA is disabled successfully), and are able to see where else memory pressure may be coming from. The next piece to understand is a little trickier, and involves how the MongoDB journal works, and how that in turn interacts with how the kernel tracks the memory usage of individual processes.

This is covered in detail as part of a lengthy MongoDB Jira issue - SERVER-9415. What we discovered when investigating that issue, is that they behavior of the journal when doing a mix of reads and writes could (not always, but it was reproducible) drastically reduce the reported resident memory for the MongoDB process. The mechanics of this have been described in detail by Kristina Chodorow here and there are more details in the Jira issue also.

So, what does all that mean?

It means that the reporting and interpretation of resident memory statistics is complex, particularly on a system that is also doing writes, and especially if that system has memory pressure outside of the mongod process. In general, I recommend the following methodology:

  • Read in (touch or manual pre-heating with a large query/explain) a large, known, amount of data that should fit into memory
  • Run some queries, aggregations etc. on that data set and verify that page faulting is minimal
  • If page faults are low, then the data is fitting into memory, you have a reporting problem. You can repeat the tests with larger data sets until you find your actual limit.
  • If page faults are high, then the data has been evicted, was not fully loaded in etc. and you have something to investigate (readahead, memory pressure, make sure NUMA is disabled etc.)

I generally recommend running MMS Monitoring (free) while testing as that lets you track memory stats as well as non-mapped memory over time, page faults and more, as well as mongostat (for sub one minute resolution) to get a decent picture of what is going on.