Atomic operation cost

atomiccpu-architecturelock-freeperformance

What is the cost of the atomic operation (any of compare-and-swap or atomic add/decrement)? How much cycles does it consume? Will it pause other processors on SMP or NUMA, or will it block memory accesses?
Will it flush reorder buffer in out-of-order CPU?

What effects will be on the cache?

I'm interested in modern, popular CPUs: x86, x86_64, PowerPC, SPARC, Itanium.

Best Answer

I have looked for actual data for the past days, and found nothing. However, I did some research, which compares the cost of atomic ops with the costs of cache misses.

The cost of the x86 LOCK prefix, (including lock cmpxchg for atomic CAS), before PentiumPro (as described in the doc), is a memory access (like a cache miss), + stopping memory operations by other processors, + any contention with other processors trying to LOCK the bus. However, since PentiumPro, for normal Writeback cacheable memory (all memory an app deals with, unless you talk directly with hardware), instead of blocking all memory operations, only the relevant cache line is blocked (based on the link in @osgx's answer).

i.e. the core delays answering MESI share and RFO requests for the line until after the store part of the actual locked operation. This is called a "cache lock", and only affects that one cache line. Other cores can be loading / storing or even CASing other lines at the same time.

Actually, the CAS case can be more complicated, as explained on this page, with no timings but an insightful description by a trustworthy engineer. (At least for the normal use-case where you do a pure load before the actual CAS.)

Before going into too much detail, I'll say that a LOCKed operation costs one cache miss + the possible contention with other processor on the same cacheline, while CAS + the preceding load (which is almost always required except on mutexes, where you always CAS 0 and 1) can cost two cache misses.

He explains that a load + CAS on a single location can actually cost two cache misses, like Load-Linked/Store-Conditional (see there for the latter). His explaination relies on knowledge of the MESI cache coherence protocol. It uses 4 states for a cacheline: M(odified), E(xclusive), S(hared), I(nvalid) (and therefore it's called MESI), explained below where needed. The scenario, explained, is the following:

the LOAD causes a cache miss - the relevant cacheline is loaded from memory in Shared state (i.e. other processors are still allowed to keep that cacheline in memory; no changes are allowed in this state). If the location is in memory, this cache miss is skipped. Possible cost: 1 cache miss. (skipped if the cacheline is in Shared, Exclusive or Modified state, i.e. the data is in this CPU's L1 cache).
the program calculates the new values to store,
and it runs an atomic CAS instruction.
- It has to avoid concurrent modification, so it must remove copies of the cacheline from the cache of other CPUs, to move the cacheline to the Exclusive state. Possible cost: 1 cache miss. This is not needed if it is already exclusively owned, i.e. in the Exclusive or Modified state. In both states, no other CPUs hold the cacheline, but in the Exclusive state it has not been modified (yet).
- After this communication, the variable is modified in our CPU's local cache, at which point it is globally visible to all other CPUs (because their caches are coherent with ours). It will eventually be written to main memory according to usual algorithms.
- Other processors trying to read or modify that variable will first have to obtain that cacheline in Shared or Exclusive mode, and to do so will contact this processor and receive the updated version of the cacheline. A LOCKed operation, instead, can only cost a cache miss (because the cacheline will be requested directly in Exclusive state).

In all cases, a cacheline request can be stalled by other processors already modifying the data.

Related Solutions

Ios – What’s the difference between the atomic and nonatomic attributes

The last two are identical; "atomic" is the default behavior (~~note that it is not actually a keyword; it is specified only by the absence of nonatomic~~ -- atomic was added as a keyword in recent versions of llvm/clang).

Assuming that you are @synthesizing the method implementations, atomic vs. non-atomic changes the generated code. If you are writing your own setter/getters, atomic/nonatomic/retain/assign/copy are merely advisory. (Note: @synthesize is now the default behavior in recent versions of LLVM. There is also no need to declare instance variables; they will be synthesized automatically, too, and will have an _ prepended to their name to prevent accidental direct access).

With "atomic", the synthesized setter/getter will ensure that a whole value is always returned from the getter or set by the setter, regardless of setter activity on any other thread. That is, if thread A is in the middle of the getter while thread B calls the setter, an actual viable value -- an autoreleased object, most likely -- will be returned to the caller in A.

In nonatomic, no such guarantees are made. Thus, nonatomic is considerably faster than "atomic".

What "atomic" does not do is make any guarantees about thread safety. If thread A is calling the getter simultaneously with thread B and C calling the setter with different values, thread A may get any one of the three values returned -- the one prior to any setters being called or either of the values passed into the setters in B and C. Likewise, the object may end up with the value from B or C, no way to tell.

Ensuring data integrity -- one of the primary challenges of multi-threaded programming -- is achieved by other means.

Adding to this:

atomicity of a single property also cannot guarantee thread safety when multiple dependent properties are in play.

Consider:

 @property(atomic, copy) NSString *firstName;
 @property(atomic, copy) NSString *lastName;
 @property(readonly, atomic, copy) NSString *fullName;

In this case, thread A could be renaming the object by calling setFirstName: and then calling setLastName:. In the meantime, thread B may call fullName in between thread A's two calls and will receive the new first name coupled with the old last name.

To address this, you need a transactional model. I.e. some other kind of synchronization and/or exclusion that allows one to exclude access to fullName while the dependent properties are being updated.

Java – the difference between atomic / volatile / synchronized

You are specifically asking about how they internally work, so here you are:

No synchronization

private int counter;

public int getNextUniqueIndex() {
  return counter++; 
}

It basically reads value from memory, increments it and puts back to memory. This works in single thread but nowadays, in the era of multi-core, multi-CPU, multi-level caches it won't work correctly. First of all it introduces race condition (several threads can read the value at the same time), but also visibility problems. The value might only be stored in "local" CPU memory (some cache) and not be visible for other CPUs/cores (and thus - threads). This is why many refer to local copy of a variable in a thread. It is very unsafe. Consider this popular but broken thread-stopping code:

private boolean stopped;

public void run() {
    while(!stopped) {
        //do some work
    }
}

public void pleaseStop() {
    stopped = true;
}

Add volatile to stopped variable and it works fine - if any other thread modifies stopped variable via pleaseStop() method, you are guaranteed to see that change immediately in working thread's while(!stopped) loop. BTW this is not a good way to interrupt a thread either, see: How to stop a thread that is running forever without any use and Stopping a specific java thread.

`AtomicInteger`

private AtomicInteger counter = new AtomicInteger();

public int getNextUniqueIndex() {
  return counter.getAndIncrement();
}

The AtomicInteger class uses CAS (compare-and-swap) low-level CPU operations (no synchronization needed!) They allow you to modify a particular variable only if the present value is equal to something else (and is returned successfully). So when you execute getAndIncrement() it actually runs in a loop (simplified real implementation):

int current;
do {
  current = get();
} while(!compareAndSet(current, current + 1));

So basically: read; try to store incremented value; if not successful (the value is no longer equal to current), read and try again. The compareAndSet() is implemented in native code (assembly).

`volatile` without synchronization

private volatile int counter;

public int getNextUniqueIndex() {
  return counter++; 
}

This code is not correct. It fixes the visibility issue (volatile makes sure other threads can see change made to counter) but still has a race condition. This has been explained multiple times: pre/post-incrementation is not atomic.

The only side effect of volatile is "flushing" caches so that all other parties see the freshest version of the data. This is too strict in most situations; that is why volatile is not default.

`volatile` without synchronization (2)

volatile int i = 0;
void incIBy5() {
  i += 5;
}

The same problem as above, but even worse because i is not private. The race condition is still present. Why is it a problem? If, say, two threads run this code simultaneously, the output might be + 5 or + 10. However, you are guaranteed to see the change.

Multiple independent `synchronized`

void incIBy5() {
  int temp;
  synchronized(i) { temp = i }
  synchronized(i) { i = temp + 5 }
}

Surprise, this code is incorrect as well. In fact, it is completely wrong. First of all you are synchronizing on i, which is about to be changed (moreover, i is a primitive, so I guess you are synchronizing on a temporary Integer created via autoboxing...) Completely flawed. You could also write:

synchronized(new Object()) {
  //thread-safe, SRSLy?
}

No two threads can enter the same synchronized block with the same lock. In this case (and similarly in your code) the lock object changes upon every execution, so synchronized effectively has no effect.

Even if you have used a final variable (or this) for synchronization, the code is still incorrect. Two threads can first read i to temp synchronously (having the same value locally in temp), then the first assigns a new value to i (say, from 1 to 6) and the other one does the same thing (from 1 to 6).

The synchronization must span from reading to assigning a value. Your first synchronization has no effect (reading an int is atomic) and the second as well. In my opinion, these are the correct forms:

void synchronized incIBy5() {
  i += 5 
}

void incIBy5() {
  synchronized(this) {
    i += 5 
  }
}

void incIBy5() {
  synchronized(this) {
    int temp = i;
    i = temp + 5;
  }
}

Best Answer

Related Solutions

Ios – What’s the difference between the atomic and nonatomic attributes

Java – the difference between atomic / volatile / synchronized

No synchronization

AtomicInteger

volatile without synchronization

volatile without synchronization (2)

Multiple independent synchronized

Related Topic

`AtomicInteger`

`volatile` without synchronization

`volatile` without synchronization (2)

Multiple independent `synchronized`