There are two questions here, one for MS clustering, and another one for Mongo.
MS Clustering
The decision of where to put the public, heart-beat, inter-node communication, and quorum drive is significant. Also cluster architecture makes a difference; you pick different quroum options if the two nodes are in adjacent racks than if they were in completely different datacenters.
Put the heartbeat on the same interface/subnet as the public interface
This theory holds that if you lose your public interface, you want the heartbeat to fail because this node is effectively down to users.
Put the heartbeat on it's own private interface/subnet
This theory holds that something outside of the cluster is arbiting who is doing what role, and unnecessary node-death is to be avoided.
Put the WFS on the heartbeat network
If the two nodes are in the same overall network (the same set of switches is supporting the non-public networks for both nodes) then putting the WFS on the heartbeat network doesn't introduce any new vulnerabilities.
If the two nodes are in different network fault domains (such as different datacenters), this is a bad idea. The heartbeat network provides the 'node majority' quorum option, and the WFS provides the 'File Share Majority' quorum option. You really want both options to be in separate fault domains.
Your revised diagram makes sense if both nodes are in the same data-center, though I myself would but the heartbeat on the public side.
MongoDB
MongoDB is a bit simpler. With even numbers of nodes, you absolutely want a third node to act as tie-breaker. They're pretty clear about that. However, your diagram states:
Up to 12 replica members (7 can vote).
7 is an odd number. You don't require an Arbiter.
Unlike Microsoft clusters, Mongo's cluster voting doesn't care about multiple avenues of network to break voting deadlocks. Because of this, separate arbitration and cluster-internal networks do not provide any meaningful increase in robustness. The only reason you'd want a separate arbitration network is if replication traffic was expected to be so heavy that election-packets (the heartbeat, actually) would get pushed so far down the stack that it would miss the 10 second timeout.
Mongo syncs instantly, so there is something wrong with your Replica set.
MongoDB replica sets are something that you need to get right from when they are first set up. If they aren't set up correctly, they can be difficult to fix.
Configuration of the Replica Set should (normally) be done from the master only. If your Set isn't live yet, the best option may be to re-create it.
Also, not sure what robomongo is, but you're probably better off using the native mongo client to find out what is going on.
The rs.status() command should give you output like this
rs0:SECONDARY> rs.status()
{
"set" : "rs0",
"date" : ISODate("2014-03-10T10:42:27Z"),
"myState" : 2,
"syncingTo" : "mongo-master:27017",
"members" : [
{
"_id" : 0,
"name" : "mongo-master:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 3008469,
"optime" : Timestamp(1394448146, 1),
"optimeDate" : ISODate("2014-03-10T10:42:26Z"),
"lastHeartbeat" : ISODate("2014-03-10T10:42:26Z"),
"lastHeartbeatRecv" : ISODate("2014-03-10T10:42:26Z"),
"pingMs" : 1
},
{
"_id" : 3,
"name" : "mongo-slave3:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 3012206,
"optime" : Timestamp(1394448146, 1),
"optimeDate" : ISODate("2014-03-10T10:42:26Z"),
"self" : true
},
{
"_id" : 4,
"name" : "mongo-slave4:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 890533,
"optime" : Timestamp(1394448146, 1),
"optimeDate" : ISODate("2014-03-10T10:42:26Z"),
"lastHeartbeat" : ISODate("2014-03-10T10:42:26Z"),
"lastHeartbeatRecv" : ISODate("2014-03-10T10:42:26Z"),
"pingMs" : 0,
"syncingTo" : "mongo-master:27017"
}
],
"ok" : 1
}
Best Answer
A few quick checks:
rs.status()
would be a good startnetstat --statistics
before and after a lag spike for example - see if retransmits or erorrs are increasing)[syncingTo][3]
field inrs.status()
. This is done using thers.syncFrom()
command.If, after all that, you still don't know what's causing this, then it may be beyond answering on serverfault in a reasonable way (would need to look at logs, stats etc.) - I'd recommend the mongodb-user Google group as the next step.