C++ Multithreading – Parallelism Level with std::thread on Multi-Core CPUs

cmultithreadingscalability

Suppose I have a one C++ process, and I want this process to run eight threads in parallel.

And suppose that:

  • I have a computer with two (2) physical CPUs.
  • Each CPU has four (4) cores, so that's 4×2 = (8) cores total.
  • Each core only allows one (1) logical thread, so (8) logical threads total.
  • I am using an x86-64-bit operating system.

My question is this:

  • Can I run eight(8) threads in parallel on the above system using one process?
  • Does this depend on which x64 operating system I am using? If so, what's the difference between Ubuntu Linux x64 and Windows 10 x64?
  • Does this depend on which compiler I am using? If so, how do GCC, VC++, clang, and Intel compare in this regard?

Note: When I say "in parallel", I do mean on completely separate logical threads.

Best Answer

Yes, a single process with eight threads can run on eight cores concurrently. To the OS, the physical arrangement of those cores (the number of sockets) is largely irrelevant.

This is true with essentially every major OS of which I'm aware (Windows, Linux, *BSD). Likewise, I'm not aware of any C++ compiler that imposes limitations in this regard either.

The main things you have to do to make it happen are ensure that your threads can actually execute independently of each other. When there are inter-dependencies, you can end up with one thread (and therefore one core) waiting for another.

Depending on the sort of things you're doing in the threads, it's often useful to create an extra thread or two (or sometimes more, especially if any might be I/O bound) so there's always a thread ready to run on any given core, even if one of the threads stalls waiting for input (or something on that order).

I should add, however, that most current OSes do take sockets into account when they're deciding where to schedule a thread. You normally have a single cache shared between cores in a single socket, but each socket has its own separate cache, so you can gain some efficiency by taking sockets into account when scheduling the threads. This won't normally stop a thread from running if it's ready, but if two threads are both ready to run, it might show a preference for scheduling each on a particular socket if possible.

Likewise, many multi-socket systems provide non-uniform memory access (NUMA). That is, each CPU is directly connected to part of the memory. It can use data stored in the memory attached to a different CPU, but doing so increases latency and may also reduce bandwidth. On such a system, it's generally preferable to schedule threads onto the socket that's directly attached to the memory holding the data they're processing.

To help optimize that, Windows and Linux both provide NUMA functions so you can control thread scheduling and memory allocation. This is an optional optimization though--your program can use all the cores on all the sockets without it, but may be able to run faster if you optimize access patterns.

Also note that if you use this badly, you can pessimize your program, so it will run even slower than it would be default...