Server 2012 R2 VSS freezing and causing DFSR and Win Backups to fail

dfs-rvsswindows-server-2012-r2

Symptom is that approx once a week DFSR (Distributed File System Replication) stops replicating and Windows Backup (wbengine.exe) stops responding. When I analyse their wait chains they're waiting on dfsrs.exe Thread: 4664 and Thread: 4680. dfsrs.exe is waiting on Thread: 4664.

I kill dfsrs.exe and wbengine starts responding. The logs say The backup failed as the creation of a shadow copy has timed out.

Restart dfsrs.exe and we're back in business.

This issue occurs on two new Server 2012 R2 machines (fully updated) with a DFS share to each other. Disabling windows backup on a machine seems to stop it from causing the issue (even though we have other software which creates multiple daily volume shadow copies). Both servers have very little non-microsoft software on them.

I took dump files of the processes in their stuck state, but I guess VSSVC.dmp is going to be the important one. I don't know how to use WinDbg well, but here's the results I got:

0:000> !analyze -hang
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

Probably caused by : dfsrs.exe ( dfsrs!RunTaskCallback+17 )

Followup: MachineOwner
-------------------------------------
0:000> !dml_proc
DbgId  PID    Image file name
0      8c8    C:\Windows\System32\dfsrs.exe
0:000> !dml_proc 0x0
DbgId  PID    Image file name
0      8c8    C:\Windows\System32\dfsrs.exe

Browse module list

Threads:
    DbgId  TID    Name (if available)
    0      8cc    "<No name>"
    1      8e8    "<No name>"
    2      1a70   "<No name>"
    3      1a94   "<No name>"
    4      1af0   "<No name>"
    5      306c   "<No name>"
    6      5d4    "<No name>"
    7      2648   "<No name>"
    8      2654   "<No name>"
    9      204c   "<No name>"
    a      2cd0   "<No name>"
    b      304c   "<No name>"
    c      784    "<No name>"
    d      eec    "<No name>"
    e      3220   "<No name>"
    f      298c   "<No name>"
    10     1238   "<No name>"
    11     2b80   "<No name>"
    12     1248   "<No name>"
    13     3464   "<No name>"
    14     2cf0   "<No name>"
    15     31ac   "<No name>"
    16     3100   "<No name>"
    17     308c   "<No name>"
    18     2a74   "<No name>"
    19     18e0   "<No name>"
-------------------
0:000> ~[0x10]s;kM
ntdll!NtWaitForSingleObject+0xa:
00007ffc`27e606fa c3              ret
 # Child-SP          RetAddr           Call Site
00 0000008e`4af1c348 00007ffc`25111118 ntdll!NtWaitForSingleObject+0xa
01 0000008e`4af1c350 00007ff7`0540754f KERNELBASE!WaitForSingleObjectEx+0x94
02 0000008e`4af1c3f0 00007ff7`0537d78c dfsrs!Task::ShutDown+0x283
03 0000008e`4af1c490 00007ff7`053e0656 dfsrs!UpdateDistributionTask::RemoveUpdateTask+0x2c
04 0000008e`4af1c540 00007ff7`053b845c dfsrs!UpdateManager::FinalizeUpdateManager+0x152
05 0000008e`4af1c5d0 00007ff7`053ab007 dfsrs!InConnection::InConnectionContentSetContext::FinalizeInConnectionContentSetContext+0x2e4
06 0000008e`4af1c740 00007ff7`053752e2 dfsrs!InConnection::DeleteContentSetFromInConnection+0x2a3
07 0000008e`4af1c870 00007ff7`0534f1b4 dfsrs!ReplicaSetManager::DeleteContentSetFromReplicaSetManager+0x1e2
08 0000008e`4af1c950 00007ff7`05338e89 dfsrs!ContentSetManager::InternalFinalizeContentSetManager+0x310
09 0000008e`4af1cbe0 00007ff7`053362c5 dfsrs!VolumeManager::FinalizeVolumeManager+0x265
0a 0000008e`4af1ccd0 00007ff7`05323b1a dfsrs!VolumeManager::ShutdownVolume+0x265
0b 0000008e`4af1ce20 00007ff7`05227034 dfsrs!FrsReplicator::FinalizeReplicator+0x182
0c 0000008e`4af1cef0 00007ff7`05239c0f dfsrs!FrsService::InternalShutdown+0x1f4
0d 0000008e`4af1d030 00007ffc`23896e83 dfsrs!VssWriter::OnPrepareSnapshot+0x12f
0e 0000008e`4af1d0a0 00007ffc`2389d145 vssapi!CVssWriterImpl::OnPrepareSnapshotGuard+0x2b
0f 0000008e`4af1d0d0 00007ffc`2389c470 vssapi!CVssWriterImpl::PrepareForSnapshotInternal+0xc71
10 0000008e`4af1e150 00007ffc`270e20f3 vssapi!CVssWriterImpl::PrepareForSnapshot+0x50
11 0000008e`4af1e1a0 00007ffc`270e6fad rpcrt4!Invoke+0x73
12 0000008e`4af1e200 00007ffc`2757d58a rpcrt4!NdrStubCall2+0x35e
13 0000008e`4af1e870 00007ffc`272522b3 combase!CStdStubBuffer_Invoke+0xa0
14 0000008e`4af1e8b0 00007ffc`275786ad oleaut32!CUnivStubWrapper::Invoke+0x53
15 0000008e`4af1e900 00007ffc`27404f5a combase!SyncStubInvoke+0x205
16 (Inline Function) --------`-------- combase!StubInvoke+0xc0
17 0000008e`4af1eaa0 00007ffc`2757951f combase!CCtxComChnl::ContextInvoke+0x27a
18 (Inline Function) --------`-------- combase!DefaultInvokeInApartment+0x51
19 0000008e`4af1ecb0 00007ffc`27578fb0 combase!AppInvoke+0x1af
1a 0000008e`4af1eda0 00007ffc`27579b35 combase!ComInvokeWithLockAndIPID+0x676
1b 0000008e`4af1efe0 00007ffc`270e2467 combase!ThreadInvoke+0x48a
1c 0000008e`4af1f0b0 00007ffc`270e22c0 rpcrt4!DispatchToStubInCNoAvrf+0x33
1d 0000008e`4af1f100 00007ffc`270eaa88 rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x190
1e 0000008e`4af1f200 00007ffc`270e2d26 rpcrt4!LRPC_SCALL::DispatchRequest+0x4c9
1f 0000008e`4af1f310 00007ffc`270e2b78 rpcrt4!LRPC_SCALL::HandleRequest+0x291
20 0000008e`4af1f3c0 00007ffc`270e195d rpcrt4!LRPC_SASSOCIATION::HandleRequest+0x238
21 0000008e`4af1f450 00007ffc`270e175e rpcrt4!LRPC_ADDRESS::ProcessIO+0x444
22 0000008e`4af1f590 00007ffc`27e0af00 rpcrt4!LrpcIoComplete+0x144
23 0000008e`4af1f630 00007ffc`27e09238 ntdll!TppAlpcpExecuteCallback+0x210
24 0000008e`4af1f6a0 00007ffc`254d13d2 ntdll!TppWorkerThread+0x888
25 0000008e`4af1fa80 00007ffc`27de54e4 kernel32!BaseThreadInitThunk+0x22
26 0000008e`4af1fab0 00000000`00000000 ntdll!RtlUserThreadStart+0x34

Any ideas where to go from here? I didn't think to run process explorer while the service was frozen, but it'll freeze again next week if we want to try then.

Best Answer

OK, problem ended up being backup software (Syncovery) which had a memory leak and was causing VSS to get stuck even when the software itself completed its job. An update seems to have solved the issue.