Can you explain why multiple threads need locks on a single-core CPU

multithreading

Assume these threads run in single core cpu. As a cpu only run one instruction in one cycle. That is said, even thought they share the cpu resource. but the computer ensure that one time one instruction. So is the lock un-necessary for multiplethreading?

Best Answer

This is best illustrated with an example.

Suppose we have a simple task that we want to perform multiple times in parallel, and we want to keep track globally of the number of times that the task has been performed, for example, counting hits on a web page.

When each thread gets to the point at which it's incrementing the count, its execution will look like this:

  1. Read the number of hits from memory into a processor register
  2. Increment that number.
  3. Write that number back to memory

Remember that every thread can suspend at any point in this process. So if thread A performs step 1, and then gets suspended, following by thread B performing all three steps, when thread A resumes, its registers will have the wrong number of hits: its registers will be restored, it will happily increment the old number of hits, and store that incremented number.

In addition, any number of other threads could have run during the time that thread A was suspended, so the count thread A writes at the end might be well below the correct count.

For that reason, it's necessary to ensure that if a thread performs step 1, it must perform step 3 before any other thread is allowed to perform step 1, which can be accomplished by all threads waiting to get a single lock before they begin this process, and freeing the lock only after the process is complete, so that this "critical section" of code cannot be incorrectly interleaved, resulting in a wrong count.

But what if the operation were atomic?

Yes, in the land of magical unicorns and rainbows, where the increment operation is atomic, then locking would not be necessary for the above example.

It's important to realize, however, that we spend very little time in the world of magical unicorns and rainbows. In almost every programming language, the increment operation is broken down into the above three steps. That's because, even if the processor supports an atomic increment operation, that operation is significantly more expensive: it has to read from memory, modify the number, and write it back to memory...and usually the atomic increment operation is an operation that can fail, meaning the simple sequence above has to be replaced with a loop (as we'll see below).

Since, even in multithreaded code, many variables are kept local to a single thread, programs are much more efficient if they assume each variable is local to a single thread, and let the programmers take care of protecting shared state between threads. Especially given that atomic operations are not usually enough to solve threading issues, as we'll see later.

Volatile variables

If we wanted to avoid locks for this particular problem, we first have to realize that the steps depicted in our first example aren't actually what happens in modern compiled code. Because compilers assume only one thread is modifying the variable, each thread will keep its own cached copy of the variable, until the processor register is needed for something else. As long as it has the cached copy, it assumes it doesn't need to go back to memory and read it again (which would be expensive). They also won't write the variable back to memory as long as it's kept in a register.

We can get back to the situation we gave in the first example (with all the same threading problems we identified above) by marking the variable as volatile, which tells the compiler that this variable is being modified by others, and so must be read from or written to memory whenever it's accessed or modified.

So a variable marked as volatile will not take us to the land of atomic increment operations, it only gets us as close as we thought we were already.

Making the increment atomic

Once we're using a volatile variable, we can make our increment operation atomic by using a low-level conditional set operation that most modern CPUs support (often called compare and set or compare and swap). This approach is taken, for example, in Java's AtomicInteger class:

197       /**
198        * Atomically increments by one the current value.
199        *
200        * @return the updated value
201        */
202       public final int incrementAndGet() {
203           for (;;) {
204               int current = get();
205               int next = current + 1;
206               if (compareAndSet(current, next))
207                   return next;
208           }
209       }

The above loop repeatedly performs the following steps, until step 3 succeeds:

  1. Read the value of a volatile variable directly from memory.
  2. Increment that value.
  3. Change the value (in main memory) if and only if its current value in main memory is the same as the value we initially read, using a special atomic operation.

If step 3 fails (because the value was changed by a different thread after step 1), it again reads the variable directly from main memory and tries again.

While the compare-and-swap operation is expensive, it's slightly better than using locking in this case, because if a thread is suspended after step 1, other threads that reach step 1 do not have to block and wait for the first thread, which can prevent costly context switching. When the first thread resumes, it will fail in its first attempt to write the variable, but will be able to continue by re-reading the variable, which again is likely less expensive than the context switch that would have been necessary with locking.

So, we can get to the land of atomic increments (or other operations on a single variable) without using actual locks, via compare and swap.

So when is locking strictly necessary?

If you need to modify more than one variable in an atomic operation, then locking will be necessary, you won't find a special processor instruction for that.

As long as you're working on a single variable, and you're prepared for whatever work you've done to fail and to have to read the variable and start over again, compare-and-swap will be good enough, however.

Let's consider an example where each thread first adds 2 to a the variable X, and then multiplies X by two.

If X is initially one, and two threads run, we expect the result to be (((1 + 2) * 2) + 2) * 2 = 16.

However, if the threads interleave, we could, even with all operations being atomic, instead have both additions occur first, and the multiplications come after, resulting in (1 + 2 + 2) * 2 * 2 = 20.

This happens because multiplication and addition are not commutative operations.

So, the operations themselves being atomic is not enough, we must make the combination of operations atomic.

We can do that either by using locking to serialize the process, or we could use one local variable to store the value of X when we started our calculation, a second local variable for the intermediate steps, and then use compare-and-swap to set a new value only if the current value of X is the same as the original value of X. If we fail, we would have to start over again by reading X and performing the calculations again.

There are several trade-offs involved: as the calculations become longer, it becomes much more likely that the running thread will be suspended, and the value will be modified by another thread before we resume, meaning failures become much more likely, leading to wasted processor time. In the extreme case of large numbers of threads with very long running calculations, we might have 100 threads read the variable and be engaged in calculations, in which case only the first to finish will succeed in writing the new value, the other 99 will still complete their calculations, but discover upon completion that they can't update the value...at which point they'll each read the value and start the calculation over. We'd likely have the remaining 99 threads repeat the same problem, wasting vast quantities of processor time.

Full serialization of the critical section via locks would be much better in that situation: 99 threads would suspend when they didn't get the lock, and we'd run each thread in order of arrival at the locking point.

If serialization is not critical (as in our incrementing case), and the calculations that would be lost if updating the number fails are minimal, there may be a significant advantage to be gained from using the compare-and-swap operation, because that operation is less expensive than locking.