How efficient is locking an unlocked mutex? What is the cost of a mutex

blockinglockingmultithreadingmutex

In a low level language (C, C++ or whatever): I have the choice in between either having a bunch of mutexes (like what pthread gives me or whatever the native system library provides) or a single one for an object.

How efficient is it to lock a mutex? I.e. how many assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?

How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?

(I am not sure how much differences there are between different hardware. If there is, I would also like to know about them. But mostly, I am interested about common hardware.)

The point is, by using many mutex which each cover only a part of the object instead of a single mutex for the whole object, I could safe many blocks. And I am wondering how far I should go about this. I.e. should I try to safe any possible block really as far as possible, no matter how much more complicated and how many more mutexes this means?


WebKits blog post (2016) about locking is very related to this question, and explains the differences between a spinlock, adaptive lock, futex, etc.

Best Answer

I have the choice in between either having a bunch of mutexes or a single one for an object.

If you have many threads and the access to the object happens often, then multiple locks would increase parallelism. At the cost of maintainability, since more locking means more debugging of the locking.

How efficient is it to lock a mutex? I.e. how much assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?

The precise assembler instructions are the least overhead of a mutex - the memory/cache coherency guarantees are the main overhead. And less often a particular lock is taken - better.

Mutex is made of two major parts (oversimplifying): (1) a flag indicating whether the mutex is locked or not and (2) wait queue.

Change of the flag is just few instructions and normally done without system call. If mutex is locked, syscall will happen to add the calling thread into wait queue and start the waiting. Unlocking, if the wait queue is empty, is cheap but otherwise needs a syscall to wake up one of the waiting processes. (On some systems cheap/fast syscalls are used to implement the mutexes, they become slow (normal) system calls only in case of contention.)

Locking unlocked mutex is really cheap. Unlocking mutex w/o contention is cheap too.

How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?

You can throw as much mutex variables into your code as you wish. You are only limited by the amount of memory you application can allocate.

Summary. User-space locks (and the mutexes in particular) are cheap and not subjected to any system limit. But too many of them spells nightmare for debugging. Simple table:

  1. Less locks means more contentions (slow syscalls, CPU stalls) and lesser parallelism
  2. Less locks means less problems debugging multi-threading problems.
  3. More locks means less contentions and higher parallelism
  4. More locks means more chances of running into undebugable deadlocks.

A balanced locking scheme for application should be found and maintained, generally balancing the #2 and the #3.


(*) The problem with less very often locked mutexes is that if you have too much locking in your application, it causes to much of the inter-CPU/core traffic to flush the mutex memory from the data cache of other CPUs to guarantee the cache coherency. The cache flushes are like light-weight interrupts and handled by CPUs transparently - but they do introduce so called stalls (search for "stall").

And the stalls are what makes the locking code to run slowly, often without any apparent indication why application is slow. (Some arch provide the inter-CPU/core traffic stats, some not.)

To avoid the problem, people generally resort to large number of locks to decrease the probability of lock contentions and to avoid the stall. That is the reason why the cheap user space locking, not subjected to the system limits, exists.