C++ – The cost of atomic counters and spinlocks on x86(_64)

atomiccc++11memory-fencesmultithreading

Preface

I recently came across some synchronization problems, which led me to spinlocks and atomic counters. Then I was searching a bit more, how these work and found std::memory_order and memory barriers (mfence, lfence and sfence).

So now, it seems that I should use acquire/release for the spinlocks and relaxed for the counters.

Some reference

x86 MFENCE – Memory Fence
x86 LOCK – Assert LOCK# Signal

Question

What is the machine code (edit: see below) for those three operations (lock = test_and_set, unlock = clear, increment = operator++ = fetch_add) with default (seq_cst) memory order and with acquire/release/relaxed (in that order for those three operations). What is the difference (which memory barriers where) and the cost (how many CPU cycles)?

Purpose

I was just wondering how bad my old code (not specifying memory order = seq_cst used) really is and if I should create some class atomic_counter derived from std::atomic but using relaxed memory ordering (as well as good spinlock with acquire/release instead of mutexes on some places …or to use something from boost library – I have avoided boost so far).

My Knowledge

So far I do understand that spinlocks protect more than itself (but some shared resource/memory as well), so, there must be something that makes some memory view coherent for multiple threads/cores (that would be those acquire/release and memory fences). Atomic counter just lives for itself and only need that atomic increment (no other memory involved and I do not really care about the value when I read it, it is informative and can be few cycles old, no problem). There is some LOCK prefix and some instructions like xchg implicitly have it. Here my knowledge ends, I don't know how the cache and buses really work and what is behind (but I know that modern CPUs can reorder instructions, execute them in parallel and use memory cache and some synchronization). Thank you for explanation.

P.S.: I have old 32bit PC now, can only see lock addl and simple xchg, nothing else – all versions look the same (except unlock), memory_order makes no difference on my old PC (except unlock, release uses move instead of xchg). Will that be true for 64bit PC? (edit: see below) Do I have to care about memory order? (answer: no, not much, release on unlock saves few cycles, that's all.)

The Code:

#include <atomic>
using namespace std;

atomic_flag spinlock;
atomic<int> counter;

void inc1() {
    counter++;
}
void inc2() {
    counter.fetch_add(1, memory_order_relaxed);
}
void lock1() {
    while(spinlock.test_and_set()) ;
}
void lock2() {
    while(spinlock.test_and_set(memory_order_acquire)) ;
}
void unlock1() {
    spinlock.clear();
}
void unlock2() {
    spinlock.clear(memory_order_release);
}

int main() {
    inc1();
    inc2();
    lock1();
    unlock1();
    lock2();
    unlock2();
}

g++ -std=c++11 -O1 -S (32bit Cygwin, shortened output)

__Z4inc1v:
__Z4inc2v:
    lock addl   $1, _counter    ; both seq_cst and relaxed
    ret
__Z5lock1v:
__Z5lock2v:
    movl    $1, %edx
L5:
    movl    %edx, %eax
    xchgb   _spinlock, %al      ; both seq_cst and acquire
    testb   %al, %al
    jne L5
    rep ret
__Z7unlock1v:
    movl    $0, %eax
    xchgb   _spinlock, %al      ; seq_cst
    ret
__Z7unlock2v:
    movb    $0, _spinlock       ; release
    ret

UPDATE for x86_64bit: (see `mfence` in `unlock1`)

_Z4inc1v:
_Z4inc2v:
    lock addl   $1, counter(%rip)   ; both seq_cst and relaxed
    ret
_Z5lock1v:
_Z5lock2v:
    movl    $1, %edx
.L5:
    movl    %edx, %eax
    xchgb   spinlock(%rip), %al     ; both seq_cst and acquire
    testb   %al, %al
    jne .L5
    ret
_Z7unlock1v:
    movb    $0, spinlock(%rip)
    mfence                          ; seq_cst
    ret
_Z7unlock2v:
    movb    $0, spinlock(%rip)      ; release
    ret

Best Answer

x86 has mostly strong memory model, all the usual stores/loads have release/acquire semantics implicitly. The exception is only SSE non-temporal store operations which require sfence to be ordered as usual. All the read-modify-write (RMW) instructions with the LOCK prefix imply full memory barrier, i.e. seq_cst.

Thus on x86, we have

test_and_set can be coded with lock bts (for bit-wise operations), lock cmpxchg, or lock xchg (or just xchg which implies the lock). Other spin-lock implementations can use instructions like lock inc (or dec) if they need e.g. fairness. It is not possible to implement try_lock with release/acquire fence (at least you'd need standalone memory barrier mfence anyway).
clear is coded with lock and (for bit-wise) or lock xchg, though, more efficient implementations would use plain write (mov) instead of locked instruction.
fetch_add is coded with lock add.

Removing the lock prefix will not guarantee atomicity for RMW operations thus such operations cannot be interpreted strictly as having memory_order_relaxed in C++ view. However in practice, you might want to access atomic variable via faster non-atomic operation when it is safe (in constructor, under lock).

In our experience, it does not really matter which exactly RMW atomic operation is performed they take almost the same number of cycles to execute (and mfence is about x0.5 of a lock operation). You can estimate performance of synchronization algorithms by counting the number of atomic operations (and mfences), and the number of memory indirections (cache misses).

Related Solutions

C++ – the difference between #include and #include “filename”

In practice, the difference is in the location where the preprocessor searches for the included file.

For #include <filename> the preprocessor searches in an implementation dependent manner, normally in search directories pre-designated by the compiler/IDE. This method is normally used to include standard library header files.

For #include "filename" the preprocessor searches first in the same directory as the file containing the directive, and then follows the search path used for the #include <filename> form. This method is normally used to include programmer-defined header files.

A more complete description is available in the GCC documentation on search paths.

C++ – What are the differences between a pointer variable and a reference variable in C++

A pointer can be re-assigned:

int x = 5;
int y = 6;
int *p;
p = &x;
p = &y;
*p = 10;
assert(x == 5);
assert(y == 10);

A reference cannot be re-bound, and must be bound at initialization:

int x = 5;
int y = 6;
int &q; // error
int &r = x;

A pointer variable has its own identity: a distinct, visible memory address that can be taken with the unary & operator and a certain amount of space that can be measured with the sizeof operator. Using those operators on a reference returns a value corresponding to whatever the reference is bound to; the reference’s own address and size are invisible. Since the reference assumes the identity of the original variable in this way, it is convenient to think of a reference as another name for the same variable.
```
int x = 0;
int &r = x;
int *p = &x;
int *p2 = &r;

assert(p == p2); // &x == &r
assert(&p != &p2);
```

You can have arbitrarily nested pointers to pointers offering extra levels of indirection. References only offer one level of indirection.

int x = 0;
int y = 0;
int *p = &x;
int *q = &y;
int **pp = &p;

**pp = 2;
pp = &q; // *pp is now q
**pp = 4;

assert(y == 4);
assert(x == 2);

A pointer can be assigned nullptr, whereas a reference must be bound to an existing object. If you try hard enough, you can bind a reference to nullptr, but this is undefined and will not behave consistently.

/* the code below is undefined; your compiler may optimise it
 * differently, emit warnings, or outright refuse to compile it */

int &r = *static_cast<int *>(nullptr);

// prints "null" under GCC 10
std::cout
    << (&r != nullptr
        ? "not null" : "null")
    << std::endl;

bool f(int &r) { return &r != nullptr; }

// prints "not null" under GCC 10
std::cout
    << (f(*static_cast<int *>(nullptr))
        ? "not null" : "null")
    << std::endl;

You can, however, have a reference to a pointer whose value is nullptr.

Pointers can iterate over an array; you can use ++ to go to the next item that a pointer is pointing to, and + 4 to go to the 5th element. This is no matter what size the object is that the pointer points to.
A pointer needs to be dereferenced with * to access the memory location it points to, whereas a reference can be used directly. A pointer to a class/struct uses -> to access its members whereas a reference uses a ..
References cannot be put into an array, whereas pointers can be (Mentioned by user @litb)
Const references can be bound to temporaries. Pointers cannot (not without some indirection):
```
const int &x = int(12); // legal C++
int *y = &int(12); // illegal to take the address of a temporary.
```
This makes const & more convenient to use in argument lists and so forth.