When is sustained 100% CPU utilization not a worry

cpu-usageprocess-priorityvirtual-machineswindows-server-2003

Please help refine a discussion going on in our shop.

Consider the following scenario. There is a Microsoft VPC running several apps and services (Windows 2003 server). The server has two or three critical roles. Every so often, CPU utilization goes to 100% on a sustained basis. One of the culprits in this is a legacy application, for which the only real solution, at this time, is to restart the service. After this, CPU utilization returns to something reasonable (on average, 60-80%). However, less often, when the server is at 100% CPU, another service appears to be using the lion's share, a security application that parses logs. Our operations team's impulse is to restart that as well when the CPU becomes pegged. Our security team points out that this is pointless, as this service is running at BelowNormal priority, so effectively it is not depriving any other process of CPU. Security argues that the 100% CPU usage in those cases should really not be considered a critical condition. If a BelowNormal priority process is using most of the CPU, then there is actually no CPU deficit at all. Operations, on the other hand, is skeptical that 100% CPU utilization could really be a condition without adverse consequences, and doesn't want to ignore it. Who is right? Is Security right that it's nothing to worry about or Operations that we ought to do something?

Best Answer

In cases like this, you need to go beyond task manager and looking at % CPU usage. That does not tell you if something is adversely affecting performance. For a case like this, the next step would be to use Performance Monitor to view System\Processor Queue Length. This tells you if processes are waiting for the CPU to become idle possibly affecting performance. This is similar to what you see in the top or load commands in Unix.

This article has a good description of the performance metrics to look at when troubleshooting these issues. It was originally for NT4, but still applies to newer versions.

Here is a more recent article from the Windows Performance Team talking about how to hunt down performance issues with the CPU.