On Intel an un-contended volatile read is quite cheap. If we consider the following simple case:
public static long l;
public static void run() {
if (l == -1)
System.exit(-1);
if (l == -2)
System.exit(-1);
}
Using Java 7's ability to print assembly code the run method looks something like:
# {method} 'run2' '()V' in 'Test2'
# [sp+0x10] (sp of caller)
0xb396ce80: mov %eax,-0x3000(%esp)
0xb396ce87: push %ebp
0xb396ce88: sub $0x8,%esp ;*synchronization entry
; - Test2::run2@-1 (line 33)
0xb396ce8e: mov $0xffffffff,%ecx
0xb396ce93: mov $0xffffffff,%ebx
0xb396ce98: mov $0x6fa2b2f0,%esi ; {oop('Test2')}
0xb396ce9d: mov 0x150(%esi),%ebp
0xb396cea3: mov 0x154(%esi),%edi ;*getstatic l
; - Test2::run@0 (line 33)
0xb396cea9: cmp %ecx,%ebp
0xb396ceab: jne 0xb396ceaf
0xb396cead: cmp %ebx,%edi
0xb396ceaf: je 0xb396cece ;*getstatic l
; - Test2::run@14 (line 37)
0xb396ceb1: mov $0xfffffffe,%ecx
0xb396ceb6: mov $0xffffffff,%ebx
0xb396cebb: cmp %ecx,%ebp
0xb396cebd: jne 0xb396cec1
0xb396cebf: cmp %ebx,%edi
0xb396cec1: je 0xb396ceeb ;*return
; - Test2::run@28 (line 40)
0xb396cec3: add $0x8,%esp
0xb396cec6: pop %ebp
0xb396cec7: test %eax,0xb7732000 ; {poll_return}
;... lines removed
If you look at the 2 references to getstatic, the first involves a load from memory, the second skips the load as the value is reused from the register(s) it is already loaded into (long is 64 bit and on my 32 bit laptop it uses 2 registers).
If we make the l variable volatile the resulting assembly is different.
# {method} 'run2' '()V' in 'Test2'
# [sp+0x10] (sp of caller)
0xb3ab9340: mov %eax,-0x3000(%esp)
0xb3ab9347: push %ebp
0xb3ab9348: sub $0x8,%esp ;*synchronization entry
; - Test2::run2@-1 (line 32)
0xb3ab934e: mov $0xffffffff,%ecx
0xb3ab9353: mov $0xffffffff,%ebx
0xb3ab9358: mov $0x150,%ebp
0xb3ab935d: movsd 0x6fb7b2f0(%ebp),%xmm0 ; {oop('Test2')}
0xb3ab9365: movd %xmm0,%eax
0xb3ab9369: psrlq $0x20,%xmm0
0xb3ab936e: movd %xmm0,%edx ;*getstatic l
; - Test2::run@0 (line 32)
0xb3ab9372: cmp %ecx,%eax
0xb3ab9374: jne 0xb3ab9378
0xb3ab9376: cmp %ebx,%edx
0xb3ab9378: je 0xb3ab93ac
0xb3ab937a: mov $0xfffffffe,%ecx
0xb3ab937f: mov $0xffffffff,%ebx
0xb3ab9384: movsd 0x6fb7b2f0(%ebp),%xmm0 ; {oop('Test2')}
0xb3ab938c: movd %xmm0,%ebp
0xb3ab9390: psrlq $0x20,%xmm0
0xb3ab9395: movd %xmm0,%edi ;*getstatic l
; - Test2::run@14 (line 36)
0xb3ab9399: cmp %ecx,%ebp
0xb3ab939b: jne 0xb3ab939f
0xb3ab939d: cmp %ebx,%edi
0xb3ab939f: je 0xb3ab93ba ;*return
;... lines removed
In this case both of the getstatic references to the variable l involves a load from memory, i.e. the value can not be kept in a register across multiple volatile reads. To ensure that there is an atomic read the value is read from main memory into an MMX register movsd 0x6fb7b2f0(%ebp),%xmm0
making the read operation a single instruction (from the previous example we saw that 64bit value would normally require two 32bit reads on a 32bit system).
So the overall cost of a volatile read will roughly equivalent of a memory load and can be as cheap as a L1 cache access. However if another core is writing to the volatile variable, the cache-line will be invalidated requiring a main memory or perhaps an L3 cache access. The actual cost will depend heavily on the CPU architecture. Even between Intel and AMD the cache coherency protocols are different.
First, you have to learn to think like a Language Lawyer.
The C++ specification does not make reference to any particular compiler, operating system, or CPU. It makes reference to an abstract machine that is a generalization of actual systems. In the Language Lawyer world, the job of the programmer is to write code for the abstract machine; the job of the compiler is to actualize that code on a concrete machine. By coding rigidly to the spec, you can be certain that your code will compile and run without modification on any system with a compliant C++ compiler, whether today or 50 years from now.
The abstract machine in the C++98/C++03 specification is fundamentally single-threaded. So it is not possible to write multi-threaded C++ code that is "fully portable" with respect to the spec. The spec does not even say anything about the atomicity of memory loads and stores or the order in which loads and stores might happen, never mind things like mutexes.
Of course, you can write multi-threaded code in practice for particular concrete systems – like pthreads or Windows. But there is no standard way to write multi-threaded code for C++98/C++03.
The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory.
Consider the following example, where a pair of global variables are accessed concurrently by two threads:
Global
int x, y;
Thread 1 Thread 2
x = 17; cout << y << " ";
y = 37; cout << x << endl;
What might Thread 2 output?
Under C++98/C++03, this is not even Undefined Behavior; the question itself is meaningless because the standard does not contemplate anything called a "thread".
Under C++11, the result is Undefined Behavior, because loads and stores need not be atomic in general. Which may not seem like much of an improvement... And by itself, it's not.
But with C++11, you can write this:
Global
atomic<int> x, y;
Thread 1 Thread 2
x.store(17); cout << y.load() << " ";
y.store(37); cout << x.load() << endl;
Now things get much more interesting. First of all, the behavior here is defined. Thread 2 could now print 0 0
(if it runs before Thread 1), 37 17
(if it runs after Thread 1), or 0 17
(if it runs after Thread 1 assigns to x but before it assigns to y).
What it cannot print is 37 0
, because the default mode for atomic loads/stores in C++11 is to enforce sequential consistency. This just means all loads and stores must be "as if" they happened in the order you wrote them within each thread, while operations among threads can be interleaved however the system likes. So the default behavior of atomics provides both atomicity and ordering for loads and stores.
Now, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here. But if your algorithm can tolerate out-of-order loads and stores; i.e., if it requires atomicity but not ordering; i.e., if it can tolerate 37 0
as output from this program, then you can write this:
Global
atomic<int> x, y;
Thread 1 Thread 2
x.store(17,memory_order_relaxed); cout << y.load(memory_order_relaxed) << " ";
y.store(37,memory_order_relaxed); cout << x.load(memory_order_relaxed) << endl;
The more modern the CPU, the more likely this is to be faster than the previous example.
Finally, if you just need to keep particular loads and stores in order, you can write:
Global
atomic<int> x, y;
Thread 1 Thread 2
x.store(17,memory_order_release); cout << y.load(memory_order_acquire) << " ";
y.store(37,memory_order_release); cout << x.load(memory_order_acquire) << endl;
This takes us back to the ordered loads and stores – so 37 0
is no longer a possible output – but it does so with minimal overhead. (In this trivial example, the result is the same as full-blown sequential consistency; in a larger program, it would not be.)
Of course, if the only outputs you want to see are 0 0
or 37 17
, you can just wrap a mutex around the original code. But if you have read this far, I bet you already know how that works, and this answer is already longer than I intended :-).
So, bottom line. Mutexes are great, and C++11 standardizes them. But sometimes for performance reasons you want lower-level primitives (e.g., the classic double-checked locking pattern). The new standard provides high-level gadgets like mutexes and condition variables, and it also provides low-level gadgets like atomic types and the various flavors of memory barrier. So now you can write sophisticated, high-performance concurrent routines entirely within the language specified by the standard, and you can be certain your code will compile and run unchanged on both today's systems and tomorrow's.
Although to be frank, unless you are an expert and working on some serious low-level code, you should probably stick to mutexes and condition variables. That's what I intend to do.
For more on this stuff, see this blog post.
Best Answer
There are read barriers and write barriers; acquire barriers and release barriers. And more (io vs memory, etc).
The barriers are not there to control "latest" value or "freshness" of the values. They are there to control the relative ordering of memory accesses.
Write barriers control the order of writes. Because writes to memory are slow (compared to the speed of the CPU), there is usually a write-request queue where writes are posted before they 'really happen'. Although they are queued in order, while inside the queue the writes may be reordered. (So maybe 'queue' isn't the best name...) Unless you use write barriers to prevent the reordering.
Read barriers control the order of reads. Because of speculative execution (CPU looks ahead and loads from memory early) and because of the existence of the write buffer (the CPU will read a value from the write buffer instead of memory if it is there - ie the CPU thinks it just wrote X = 5, then why read it back, just see that it is still waiting to become 5 in the write buffer) reads may happen out of order.
This is true regardless of what the compiler tries to do with respect to the order of the generated code. ie 'volatile' in C++ won't help here, because it only tells the compiler to output code to re-read the value from "memory", it does NOT tell the CPU how/where to read it from (ie "memory" is many things at the CPU level).
So read/write barriers put up blocks to prevent reordering in the read/write queues (the read isn't usually so much of a queue, but the reordering effects are the same).
What kinds of blocks? - acquire and/or release blocks.
Acquire - eg read-acquire(x) will add the read of x into the read-queue and flush the queue (not really flush the queue, but add a marker saying don't reorder anything before this read, which is as if the queue was flushed). So later (in code order) reads can be reordered, but not before the read of x.
Release - eg write-release(x, 5) will flush (or marker) the queue first, then add the write-request to the write-queue. So earlier writes won't become reordered to happen after x = 5, but note that later writes can be reordered before x = 5.
Note that I paired the read with acquire and write with release because this is typical, but different combinations are possible.
Acquire and Release are considered 'half-barriers' or 'half-fences' because they only stop the reordering from going one way.
A full barrier (or full fence) applies both an acquire and a release - ie no reordering.
Typically for lockfree programming, or C# or java 'volatile', what you want/need is read-acquire and write-release.
ie
So, first of all, this is a bad way to program threads. Locks would be safer. But just to illustrate barriers...
After threadA() is done writing foo, it needs to write foo->ready LAST, really last, else other threads might see foo->ready early and get the wrong values of x/y/z. So we use a
write_release
on foo->ready, which, as mentioned above, effectively 'flushes' the write queue (ensuring x,y,z are committed) then adds the ready=true request to the queue. And then adds the bar=13 request. Note that since we just used a release barrier (not a full) bar=13 may get written before ready. But we don't care! ie we are assuming bar is not changing shared data.Now threadB() needs to know that when we say 'ready' we really mean ready. So we do a
read_acquire(foo->ready)
. This read is added to the read queue, THEN the queue is flushed. Note thatw = some_global
may also still be in the queue. So foo->ready may be read beforesome_global
. But again, we don't care, as it is not part of the important data that we are being so careful about. What we do care about is foo->x/y/z. So they are added to the read queue after the acquire flush/marker, guaranteeing that they are read only after reading foo->ready.Note also, that this is typically the exact same barriers used for locking and unlocking a mutex/CriticalSection/etc. (ie acquire on lock(), release on unlock() ).
So,
I'm pretty sure this (ie acquire/release) is exactly what MS docs say happens for read/writes of 'volatile' variables in C# (and optionally for MS C++, but this is non-standard). See http://msdn.microsoft.com/en-us/library/aa645755(VS.71).aspx including "A volatile read has "acquire semantics"; that is, it is guaranteed to occur prior to any references to memory that occur after it..."
I think java is the same, although I'm not as familiar. I suspect it is exactly the same, because you just don't typically need more guarantees than read-acquire/write-release.
In your question you were on the right track when thinking that it is really all about relative order - you just had the orderings backwards (ie "the values that are read are at least as up-to-date as the reads before the barrier? " - no, reads before the barrier are unimportant, its reads AFTER the barrier that are guaranteed to be AFTER, vice versa for writes).
And please note, as mentioned, reordering happens on both reads and writes, so only using a barrier on one thread and not the other WILL NOT WORK. ie a write-release isn't enough without the read-acquire. ie even if you write it in the right order, it could be read in the wrong order if you didn't use the read barriers to go with the write barriers.
And lastly, note that lock-free programming and CPU memory architectures can be actually much more complicated than that, but sticking with acquire/release will get you pretty far.