“Peaky” CPU Usage on Domain Controllers

central-processing-unitcpu-usagedomain-controllerwindows-server-2008

We have two Windows Server 2008 SP2 (sadly not 2008 R2) Domain Controllers in a small 150 client domain that are exhibiting very "peaky" CPU usage. The Domain Controllers both exhibit the same behavior and are hosted on vSphere 5.5.0, 1331820. Every two or three seconds the CPU usage jumps up to 80-100% and then quickly drops, remains low for a second or two and then jumps up again.

DC3 Task Manager Performance

Looking at the historical performance data for the virtual machine indicates that this condition has been going on for at least a year but the frequency has increased since March.

DC3 Virtual Machine Performance

The offending process is SVChost.exe which is wrapping the DHCP Client (dhcpcsvc.dll), EventLog (wevtsvc.dll) and LMHOSTS (lmhsvc.dll) services. I'm certainly not a Windows internals expert but I could not seem to find anything especially amiss when viewing the process with Process Explorer other than it appears the EventLog is triggering a ton of RpcBindingUnbind calls.

DC3 Process Explorer for SVCHost.exe

At this point I'm out of coffee and ideas. How should I continue to troubleshoot this issue?

Best Answer

TL;DR: EventLog file was full. Overwriting entries is expensive and/or not implemented very well in Windows Server 2008.


At @pk. and @joeqwerty suggestion and after asking around, I decided that it seemed most likely that a forgotten monitoring implementation was scraping the event logs.

I installed Microsoft's Network Monitor on one of the Domain Controllers and started filtering for MSRPC using the ProtocolName == MSRPC filter. There was lots of traffic but it was all between our remote site's RODC and unfortunately did not use same destination port as the listening EventLog process. Darn! There goes that theory.

To simplify things and make it easier to run monitoring software I decided to unwrap the EventLog service from SVCHost. The following command and a reboot of the Domain Controller dedicates one SVCHost process to the EventLog service. This makes investigation a little easier since you do not have multiple services attached to that PID.

SC config EventLog Type= own

I then resorted to ProcMon and setup a filter to exclude everything that did not use that PID. I did not see tons of failed attempts by EventLog to open missing registry keys as indicated as a possible cause here (apparently crappy applications can register as a Event Sources in extremely poor ways). Predictably I saw lots of successful ReadFile entries of the Security Event Log (C:\Windows\System32\WinEvt\Logs\Security.evtx).

ReadFile Security.evtx

Here's a look at the Stack on one of those events: RpcBindingUnbind

You'll notice first the RPCBinding and then RPCBindingUnbind. There were a lot of these. Like thousands per second. Either the Security Log is really busy or something is not working right with the Security.evtx log.

In EventViewer the Security Log was only logging a between 50-100 events per minute which seemed appropriate for a domain of this size. Darn! There goes theory number two that we had some application with very verbose event auditing turned on left in a forgotten corner still dutifully chugging away. There were a still a lot (~250,000) of events recorded even though the rate of events being logged was low. Log size perhaps?

Security Logs - (Right Click) - Properties... and the maximum log size was set for 131,072 KB and log size was currently holding at 131,072 KB. The 'Overwrite events as needed' radio button was checked. I figured that constantly deleting and writing to the log file was probably hard work especially when it was so full so I opted to Clear the Log (I saved the old log just in case we need it for auditing later) and let the EventLog service create a new empty file. The result: CPU usage returned to a sane level around 5%.