R – Are partially updated values when multithreading still a concern on modern CPUs

Architectureatomiclockingmultithreadingwikipedia

From the Wikipedia article on Read-Copy-Update:

The reason that it is safe to run the removal phase concurrently with readers is the semantics of modern CPUs guarantee that readers will see either the old or the new version of the data structure rather than a partially updated reference.

Is this true for all modern CPUs (ARM, x86, PPC, etc.)? Is it likely to change in the future? It seems awfully nice to never need to pay the cost of doing a locked load so long as you don't mind possibly getting the old value again (this probably isn't an issue for many applications — basically for any app that could use read-copy-update).

Best Answer

It's still not safe to assume your hardware will support unchecked updates.

If you're coding in something low-level (C/C++), use macros to wrap the based operations. Then, if you're SURE a particular hardware configuration will work natively you can always #define those operations to be trivial, just as if you didn't protect yourself.

But generally it's better to be right than fast.

Related Solutions

Java – Forcing multiple threads to use multiple CPUs when they are available

There are two basic ways to multi-thread in Java. Each logical task you create with these methods should run on a fresh core when needed and available.

Method one: define a Runnable or Thread object (which can take a Runnable in the constructor) and start it running with the Thread.start() method. It will execute on whatever core the OS gives it -- generally the less loaded one.

Tutorial: Defining and Starting Threads

Method two: define objects implementing the Runnable (if they don't return values) or Callable (if they do) interface, which contain your processing code. Pass these as tasks to an ExecutorService from the java.util.concurrent package. The java.util.concurrent.Executors class has a bunch of methods to create standard, useful kinds of ExecutorServices. Link to Executors tutorial.

From personal experience, the Executors fixed & cached thread pools are very good, although you'll want to tweak thread counts. Runtime.getRuntime().availableProcessors() can be used at run-time to count available cores. You'll need to shut down thread pools when your application is done, otherwise the application won't exit because the ThreadPool threads stay running.

Getting good multicore performance is sometimes tricky, and full of gotchas:

Disk I/O slows down a LOT when run in parallel. Only one thread should do disk read/write at a time.
Synchronization of objects provides safety to multi-threaded operations, but slows down work.
If tasks are too trivial (small work bits, execute fast) the overhead of managing them in an ExecutorService costs more than you gain from multiple cores.
Creating new Thread objects is slow. The ExecutorServices will try to re-use existing threads if possible.
All sorts of crazy stuff can happen when multiple threads work on something. Keep your system simple and try to make tasks logically distinct and non-interacting.

One other problem: controlling work is hard! A good practice is to have one manager thread that creates and submits tasks, and then a couple working threads with work queues (using an ExecutorService).

I'm just touching on key points here -- multithreaded programming is considered one of the hardest programming subjects by many experts. It's non-intuitive, complex, and the abstractions are often weak.

Edit -- Example using ExecutorService:

public class TaskThreader {
    class DoStuff implements Callable {
       Object in;
       public Object call(){
         in = doStep1(in);
         in = doStep2(in);
         in = doStep3(in); 
         return in;
       }
       public DoStuff(Object input){
          in = input;
       }
    }

    public abstract Object doStep1(Object input);    
    public abstract Object doStep2(Object input);    
    public abstract Object doStep3(Object input);    

    public static void main(String[] args) throws Exception {
        ExecutorService exec = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
        ArrayList<Callable> tasks = new ArrayList<Callable>();
        for(Object input : inputs){
           tasks.add(new DoStuff(input));
        }
        List<Future> results = exec.invokeAll(tasks);
        exec.shutdown();
        for(Future f : results) {
           write(f.get());
        }
    }
}

C++ – The cost of atomic counters and spinlocks on x86(_64)

x86 has mostly strong memory model, all the usual stores/loads have release/acquire semantics implicitly. The exception is only SSE non-temporal store operations which require sfence to be ordered as usual. All the read-modify-write (RMW) instructions with the LOCK prefix imply full memory barrier, i.e. seq_cst.

Thus on x86, we have

test_and_set can be coded with lock bts (for bit-wise operations), lock cmpxchg, or lock xchg (or just xchg which implies the lock). Other spin-lock implementations can use instructions like lock inc (or dec) if they need e.g. fairness. It is not possible to implement try_lock with release/acquire fence (at least you'd need standalone memory barrier mfence anyway).
clear is coded with lock and (for bit-wise) or lock xchg, though, more efficient implementations would use plain write (mov) instead of locked instruction.
fetch_add is coded with lock add.

Removing the lock prefix will not guarantee atomicity for RMW operations thus such operations cannot be interpreted strictly as having memory_order_relaxed in C++ view. However in practice, you might want to access atomic variable via faster non-atomic operation when it is safe (in constructor, under lock).

In our experience, it does not really matter which exactly RMW atomic operation is performed they take almost the same number of cycles to execute (and mfence is about x0.5 of a lock operation). You can estimate performance of synchronization algorithms by counting the number of atomic operations (and mfences), and the number of memory indirections (cache misses).

Best Answer

Related Solutions

Java – Forcing multiple threads to use multiple CPUs when they are available

C++ – The cost of atomic counters and spinlocks on x86(_64)

Related Topic