Windows DFSR – Changed replicated directory permissions and now have a 350,000 backlog for more than a week

dfs-rreplicationwindowswindows-server-2008-r2

Question: Is there a way to make this 350,000 file backlog complete faster? For nearly every file the only change was a change to the ACL for each affected file. Some files have changed content, but that is not the common case in this situation.

This might be fixed. I'll edit this text to confirm success/fail after a period of time and verification. Toward the end of this question text I have detailed the changes made recently that might have fixed it.

We have a DFSR replication group with about 450,000 files and takes up 1.5TB of space. In this situation, there are two Windows Server 2008 R2 servers that are about 500 miles apart. There are other servers, but they aren't involved in this replication group. Server ALPHA is the main server and is the one used by most of the staff. Server BETA is the server in the remote office and is less busy.

Here is a graph of backlog for this replication group (PNG hosted on Google Drive) showing the slow sync progress.

I needed to remove a permission entry that was in the root directory of that replication group, which of course was inherited across most of the subfolders. I made this change on server ALPHA. Right away after that, DFSR had a 350,000 file backlog. It has been more than a week and now it is at 267,000. The only thing that changed (initially) was the single permission change.

This is what happened (this is not the solution, just another explanation of what happened to cause this issue): http://blogs.technet.com/b/askds/archive/2012/04/14/saturday-mail-sack-because-it-turns-out-friday-night-was-alright-for-fighting.aspx#dfsr

Any changes that occur on server BETA are replicated to server ALPHA very quickly since there is no backlog in that direction. Any files changed on BETA do make it to ALPHA without trouble.

It's replicating 24/7 at full speed across a 50Mbps connection one end to a fiber 100Mbps on the other end. The staging area is 100GB on each server. There is nothing interesting in the event logs at all. There is an unrelated high watermark event that shows up for an unrelated replication group that is neither for this particular replication nor for this ALPHA/BETA server pair. In particular there are no event log entries for high watermark nor for connection errors.

ALPHA's view of the replication group:

Bandwidth Savings: 99.83% reduction (30.85 MB replicated instead of 18.1 GB)

I believe that the 30.85MB/18.1GB happened since I last restarted the DFSR service on ALPHA and BETA. If so, this shows that even though it is taking a very long time (longer than I believe it should take) it isn't actually transferring the file contents across the wire.

Replicated folder: 1.46TB (actual size), 439,387 (files), 52,886 (folders)

Conflict and Deleted folder: 100.00GB (configured size), 34.01GB (actual size), 19,620 (files), 2,393 (folders)

Staging folder: 200.00GB (configured size), 92.54GB (actual size)

I got one high watermark error in the logs (May 14, 7pm) and so have upped the staging quota to 200GB from 100GB. I know that the Microsoft-approved route is to increase by 20%, but I'm not playing around on this. We have plenty of disk space to spare on the staging disk arrays.

Disabling anti-virus on all servers did not help, though I thought it would have helped a little bit. For now I have re-enabled anti-virus but set the replication group's path to be excluded from scanning in order to remove that variable from the equation.

Is there a way to get this to go faster? I would just make this change on server BETA as well, but there are files that have changed on ALPHA but haven't replicated to BETA and by making the inherited permission change on BETA would push OLD files from BETA to ALPHA (because DFSR seems to ignore file timestamps when comparing which file is the winner in a collision). And having that happen would be rather bad.

The backlog is reducing slowly. Very, very slowly. It is going forward, though. But at this rate, it will be weeks before it finishes. I'm contemplating just shoving a copy of the data set onto a 3TB drive and shipping it to the remote office. Is there a better way?

May 16, 4am US PT: What might have fixed the problem (assuming it's honestly fixed, anyway):

I made multiple changes to the DCs that should have been made a long time ago. The problem is that this network was inherited from someone else who probably inherited it from someone else, etc. I can't promise which change fixed the problem. Here they are in no particular order:

  • All DCs were not in the "Domain Controllers" OU. I've never seen a Windows Domain that had their DCs elsewhere. I moved them back to where they belonged. They were previously in OUs that were segregated by the name of the city each office is in. (I have a feeling I've got some plumbing work to deal with now that I moved those, but all seems okay at present…)
  • AVG Anti-Virus is running on all DCs and DFSR-participating servers. I excluded the replicated folders and the staging folders from active/on-access scanning. I don't think this fixed the problem and I'm likely to test this issue later on to see if undoing that change will interfere with the replication speed of DFSR. That's a challenge for another day.
  • dcdiag.exe complained of a DNS issue with regard to RODCs. I remedied that problem even though we have no RODCs on the domain at all. I doubt this fixed anything.
  • One of the _ldap._tcp.domain.GUID._msdcs.DOMAIN.NET SRV records was missing for one of the DCs (not one of the DFSR servers) and I remedied that. I don't think this helped either.
  • One of the times I rebooted server BETA it complained of a bad shutdown of the DFSR database (event 2212) and it then proceeded to take hours to rebuild the database. When finished it reported event 2214 to let me know it finished. After that, replication was still running extremely slowly, but it might have helped unstick whatever was stuck.
  • One of the DCs didn't have 127.0.0.1 as a secondary DNS server in its interface configuration. I added it. This wasn't one of the DFSR servers, so that probably had nothing to do with it.
  • I followed the TechNet Blog: Tuning replication performance in DFSR recommended Registry settings for DFSR servers. I used all of the "tested high performance value" values except for AsyncIoMaxBufferSizeBytes was set to 4194304, which is one notch lower than the high value. This could have helped with the problem… or maybe not. It's difficult to tell when one changes too many variables.
  • dcdiag.exe complained about a problem with communicating with the RPC service on BETA, but only after already making the above changes. This seemed to be the most likely issue going on, but there was nothing I did to correct it. The VPN was running properly and the firewall wasn't blocking it. It's possible that one of the above items is what caused and then remedied the RPC issue or it could have been simple coincidence. I am not getting that error now and replication is running smoothly at present.

The moral of the story is: change one thing at a time or you'll never really know what fixed it. But I was desperate and was running out of time to fix it, so I just fired a bunch of bullets at the problem. If I ever pinpoint the fix, I'll report that here. Don't bank on me narrowing it down, though.

EDIT 5/21/2012:
I solved this by driving for about seven hours with a spare server (GAMMA) to the remote office yesterday. GAMMA is now acting as their primary local server while their usual server (BETA) catches up on the replication. Since I put it into place, the servers have been going about double the replication speed. While this tells me it could be a VPN-related issue, I'm less inclined to believe that it is since all new updates seem to replicate to GAMMA from ALPHA have been very quick and going well.

EDIT 5/22/2012:
It's at 12000 right now and should be finished in a few hours. I'll post a nice graph of the progress from slow start to fast finish. The problem is that the only thing that really actually "fixed" it is the local server connection. I'm presently thinking that maybe the VPN is part of the problem. And if that's the case, I feel that this question isn't quite answered yet. After I've had some more time to check out how things are replicating via the VPN and seeing any failures, I'll debug and report the progress.

If something changes I'll update here.

Best Answer

Very strange problem, especially after reviewing the edit.

I would inspect the DFSR debug log, which is located here: %systemroot%\debug By default there should be 9 previous log files that have been GZ archived, and one that is currently being written to.

Open that up in a text file, and do a search for the text "warning" or "error". You can check out this blog series for more detailed information on the debug logs: http://blogs.technet.com/b/askds/archive/2009/03/23/understanding-dfsr-debug-logging-part-1-logging-levels-log-format-guid-s.aspx

Other questions/suggestions:

Is there anything out of place when looking at the Resource Monitor? Excess hard drive or CPU activity that is outside a baseline?

If possible I'd restart both Alpha and Beta servers. If it resolves your issue you may never know what the real problem was, but if its critical that this is resolved soon it is worth a try.

Edit based on Question Update

You mentioned two entries related to an 850 MB file, as well as an error within the DFSR debug log.

Can you try changing the Staging Location to a different folder or drive on each server? In case the files that are currently being staged are corrupt or blocking the replication in some way.