MongoDB Replica-Set with Replication Lag on one node only

database-replicationlagmongodbreplication

we experience a strange behaviour in our MongoDB Replica-Set, setup of 3 Nodes (all Xeon Quad-Core-Class CPUs, 16GB of RAM for one, 24GB for the other two nodes)
The one node with less RAM is normal secondary with priority 0, other two priority 1. Recently we experienced a Replication-Lag of about 60 seconds every 3 to 4 hours, self disappearing after 2-3minutes (Nagios Checks!)

We have almost no traffic on those machines, just some databases with a size of 0,3GB and one is 5GB. And we have one collection which has about 65000 entries but also an id index.

The Strange thing is, that the 16gb-secondary has no lag, but only the secondary from the two larger machines. i just changed it to be primary to see if the old primary (now secondary) also has this behaviour.

Does anyone know what we can do or check? Because we have no clue.

I checked the Load and processes of those machines, the network connectivity and routing, disk states – everyhtings fine.

Best Answer

A few quick checks:

  • Are you running on 2.0 or below? Replication got a major overhaul in 2.2
  • Do you have any capped collections? A missing index on _id in a capped collection can cause this kind of lag
  • You mention that the hosts are not too busy - if you have gaps in your new ops, the math used to calculate lag can falsely report lag when no ops are happening
  • How are you calculating the lag? I would definitely try to confirm any lag from the shell - last optime from the entries in rs.status() would be a good start
  • Double check on the network side of things, latency spikes and/or intermittent packet loss could cause this and be transient enough to be hard to detect (take a look at netstat --statistics before and after a lag spike for example - see if retransmits or erorrs are increasing)
  • If you are running 2.2, see if switching the host the lagging secondary is syncing from, somewhat confusingly revealed by the [syncingTo][3] field in rs.status(). This is done using the rs.syncFrom() command.
  • If it's not there already, get the set into MMS and see if anything is spiking on/around the same time as the lag spike to point you in the right direction.

If, after all that, you still don't know what's causing this, then it may be beyond answering on serverfault in a reasonable way (would need to look at logs, stats etc.) - I'd recommend the mongodb-user Google group as the next step.

Related Topic