Server Performance – Impact of 100% CPU Usage on Parallel Tasks

cpuserverwindows

I am running 10 instances of the same executable where each executable is accessing a different 1/10 chunk of the total data that needs to be processed on a Windows Server 2012 R2. (Intel(R) Xeon(R) 2.4GHz (2 processors), 64GB RAM)

My question is if the CPU hits 100% usage, are the 10 instances still being parallel processed? Or Does the CPU start sequencing them instead of parallel processing them? Am I better off simply reducing the number of instances till the CPU usage is less than 100%?

Best Answer

My question is if the CPU hits 100% usage, are the 10 instances still being parallel processed?

Yes.

Or Does the CPU start sequencing them instead of parallel processing them?

Your processor probably doesn't have 10 CPU cores, so yes, it is doing swapping of them in some sequence, too. But it does not suddenly run one thread to completion when it gets to 100%. It is using swapping to give each thread a time slice; and it is doing this whether it is at 100% or not.

Am I better off simply reducing the number of instances till the CPU usage is less than 100%?

No, you want 100% CPU usage, or else you're wasting cycles!

However, that is not even close to the whole story, and, 100% CPU is not a great measurement of the efficiency of the CPU cycles that are being used.

You should probably run as many threads as your processor has CPU cores. (If you are using single threaded processes, then count those as threads.) When you run more than that, then the operating system swaps the threads into and out of the CPU cores in a balance of fairness and other policies that gives each thread a time slice here and there.

Swapping causes various kinds of overheads: direct overhead as cycles are wasted doing swapping instead of your work, and indirect overhead, which has to do with the operation of the caches — the highest speed memories that are directly on the processor die. Caches form one level of the memory hierarchy, which also includes main memory (your 16+ GB of RAM) and ultimately also includes the disc as a form of memory. Using the memory hierarchy well is the key to performance throughput.

As each thread runs for a while, it will take over part of the cache for it's own purposes. In some sense, the cache is warming up to that thread, and this is good.

When the operating system can, it will give consecutive time slice to the same thread on the same processor, and the that thread will have a warm cache to start.

Too many threads at once will cause undo cache pollution as due to their being swapped in: when a thread is first swapped in, the cache will be cold for it, and it will have to warm up before you're getting to best performance.


Yet still, processing a large data set efficiently requires not just keeping the cores busy but also broadly getting better use of the memory hierarchy.

One approach to finding a good way to slice the work, which you can tweak and refine by timing, is something like this:

Divide the whole work into chunk size that are about the size of the large processor cache (L3, maybe).

Then divide each chunk of work into sections of as many processor cores as the processor has (e.g. 2, 4), and run those. Then as those sections complete, run subsequent sections of the next chunk.

A lot has to do with the workload, and the cache architecture. Smaller L1 caches are likely to be replicated and each dedicated to a particular CPU core; larger L2 and L3 caches may be shared by all cores, though potentially with restriction such that one CPU core doesn't starve the others.

It goes to how your specific workload runs on the specific cache architecture on the machine in question. So, subdividing the chunks by the L3 size may work well, but chunks of the L2 size may work better.

One application/workload may access its section over and over in a more or less random way, while a different application/workload may only run serially thru the section start to finish. These different workloads have different efficiency aspects in the memory hierarchy, so breaking them up in some different way might help or hurt one or the other workload.

There are also better measurements you might want to take using various tooling.

Related Topic