Azure – mongodb keep looping while trying to sync data

azuredatabasedatabase-replicationmongodbreplication

I'm converting my my standalone mongodb into a replica set. I've added one more member (and I want to add two more members later, and to shutdown the primary server).

My primary mongodb is running 2.2.3, and the new replica member running the latest ver of mongodb, 2.6.4.

Both of the db's running on Ubuntu server 14.04, At microsoft Azure and They are running in the same Affinity Group. (Vm size is A2)

I've edited the ulimit of "nofile" and "nproc" to 65535, After seeing the MMMS-monitoring advice, BUt only on the secondaries, to avoid the reboot time of the machines, Is it necessary?

I got somewhere above 80m document on the primary database and it is running on live production. is it beacuse of this?

After few hours of syncing data, the TTL showed the following errors, and started to sync all over again. and it's keeps looping.

[rsSync] done building bottom layer, going to commit

[rsSync] old journal file will be removed:
/datadrive/data/journal/j._9

[rsSync] build index done. scanned 55381316 total records. 1348.97
secs

[conn221] serverStatus was very slow: { after basic: 0, after asserts:
0, after backgroundFlushing: 0, after connections: 0, after cursors:
0, after dur: 0, after extra_info: 0, after globalLock: 0, after
indexCounters: 0, after locks: 0, after network: 0, after opcounters:
0, after opcountersRepl: 0, after recordStats: 744214, after repl:
744214, at end: 744214 }

[conn221] command admin.$cmd command: serverStatus { serverStatus: 1 }
keyUpdates:0 numYields:0 locks(micros) r:31 reslen:3920 1243515ms

[conn228] serverStatus was very slow: { after basic: 0, after asserts:
0, after backgroundFlushing: 0, after connections: 0, after cursors:
0, after dur: 0, after extra_info: 0, after globalLock: 0, after
indexCounters: 0, after locks: 0, after network: 0, after opcounters:
0, after opcountersRepl: 0, after recordStats: 634932, after repl:
634932, at end: 634932 }

[conn228] command admin.$cmd command: serverStatus { serverStatus: 1 }
keyUpdates:0 numYields:0 locks(micros) r:33 reslen:3920 1073310ms

[conn235] serverStatus was very slow: { after basic: 0, after asserts:
0, after backgroundFlushing: 0, after connections: 0, after cursors:
0, after dur: 0, after extra_info: 0, after globalLock: 0, after
indexCounters: 0, after locks: 0, after network: 0, after opcounters:
0, after opcountersRepl: 0, after recordStats: 578551, after repl:
578551, at end: 578551 }

[conn235] command admin.$cmd command: serverStatus { serverStatus: 1 }
keyUpdates:0 numYields:0 locks(micros) r:28 reslen:3920 963376ms

[conn194] SocketException handling request, closing client connection:
9001 socket exception [SEND_ERROR] server [ServerIp:1250]

[conn252] SocketException handling request, closing client connection:
9001 socket exception [SEND_ERROR] server [ServerIp:1248]

[rsSync] Socket say send() errno:110 Connection timed out
ServerIp:27017

[rsSync] replSet initial sync exception: 9001 socket exception
[SEND_ERROR] server [Serverip:27017] 8 attempts remaining

[rsSync] replSet initial sync pending

[rsSync] replSet syncing to: [ServerAddress]:27017

[rsSync] replSet initial sync drop all databases

[rsSync] dropAllDatabasesExceptLocal 2

[rsSync] removeJournalFiles

[rsSync] replSet initial sync clone all databases

[rsSync] replSet initial sync cloning db: PkgsKeyValues

[FileAllocator] allocating new datafile
/datadrive/data/PkgsKeyValues.ns, filling with zeroes…

[FileAllocator] allocating new datafile
/datadrive/data/PkgsKeyValues.3, filling with zeroes…

[FileAllocator] done allocating datafile
/datadrive/data/PkgsKeyValues.3, size: 512MB, took 0.124 secs

Any ideas?

Best Answer

It's hard to say for sure based on such limited information, but that looks like something is killing the connection between the secondary and the primary while it is trying to sync. If it is happening repeatedly at approximately the same time, then it suggests that something in your network is enforcing a max connection time. If it is happening randomly it suggests that there is something flakey on the network itself (impossible to tell what that might be without significant troubleshooting).

It's also obvious from the snippet of logs that the secondary is under heavy load when this happens since it is taking multiple hours to run ServerStatus (usually sub 100ms when not under load) - now, it is building an index at the time, which is a blocking operation, so that may be a red herring if that is a large index. If that is not a large index, then it suggests the secondary is a little under provisioned in terms of resources.

If you can't solve the loop you can take other measures to get the secondary up and running like copying the data files, however unless you have snapshotting options that will involve stopping writes or taking down time for the duration of the copy.

Related Topic