Python – Early attempt to remove Python GIL resulted in bad performance: Why

python

This post from Python creator, Guido Van Rossum, mentions an early attempt to remove the GIL from Python:

This has been tried before, with disappointing results, which is why
I'm reluctant to put much effort into it myself. In 1999 Greg Stein
(with Mark Hammond?) produced a fork of Python (1.5 I believe) that
removed the GIL, replacing it with fine-grained locks on all mutable
data structures. He also submitted patches that removed many of the
reliances on global mutable data structures, which I accepted.
However, after benchmarking, it was shown that even on the platform
with the fastest locking primitive (Windows at the time) it slowed
down single-threaded execution nearly two-fold, meaning that on two
CPUs, you could get just a little more work done without the GIL than
on a single CPU with the GIL. This wasn't enough, and Greg's patch
disappeared into oblivion. (See Greg's writeup on the performance.)

I can hardly argue with the actual results, but I really wonder why this happened. Presumably, the main reason that removing the GIL from CPython is so difficult is because of the reference counting memory management system. A typical Python program will call Py_INCREF and Py_DECREF thousands or millions of times, making it a key contention point if we were to wrap locks around it.

But, I don't understand why adding atomic primitives would slow down a single threaded program. Suppose we just modified CPython so that the refcount variable in each Python object was an atomic primitive. And then we just do an atomic increment (fetch-and-add instruction) when we need to increment the reference count. This would make Python reference counting thread-safe, and shouldn't have any performance penalty on a single-threaded application, because there would be no lock contention.

But alas, many people who are smarter than me have tried and failed, so obviously I'm missing something here. What is wrong with the way I'm looking at this problem?

Best Answer

I am unfamiliar with the Greg Stein Python fork, so discount this comparison as speculative historical analogy if you wish. But this was exactly the historical experience of many infrastructure codebases moving from single- to multi-threaded implementations.

Essentially every Unix implementation I studied in the 1990s--AIX, DEC OSF/1, DG/UX, DYNIX, HP-UX, IRIX, Solaris, SVR4, and SVR4 MP--all went through exactly this kind of "we put in finer-grained locking--now it's slower!!" problem. The DBMSs I followed--DB2, Ingres, Informix, Oracle, and Sybase--they all went through it too.

I have heard "these changes won't slow us down when we're running single-threaded" a million times. It never works out that way. The simple act of conditionally checking "are we running multithreaded, or not?" adds real overhead, especially on highly-pipelined CPUs. Atomic operations and occasional spin-locks added to ensure the integrity of shared data structures have to be called quite often, and they're very slow. First-generation lock/synchronization primitives also were slow. Most implementation teams eventually add several classes of primitives, in various "strengths," depending on how much interlock protection was needed at various places. Then they realize where they initially slapped down locking primitives was not really the right place, so they had to profile, design around the bottlenecks found, and systematically roto-till. Some of these sticking points eventually got OS or hardware acceleration, but that whole evolution took 3-5 years, bare minimum. Meanwhile, the MP or MT versions were limping, performance-wise.

Otherwise-sophisticated development teams have argued that such slowdowns are basically a persistent, intractable fact of life. IBM e.g. refused to SMP-enable AIX for at least 5 years after the competition, adamant that single-threaded was just purely better. Sybase used some of the same arguments. The only reason some of the teams eventually came around was that single-thread performance could no longer be reasonably improved at a CPU level. They were forced to either go MP/MT or accept having an increasingly uncompetitive product.

Active concurrency is HARD. And it's deceptive. Everyone rushes into it thinking "this won't be so bad." Then they hit the quicksand, and have to plod through. I've seen this happen with at least a dozen name-brand, well-funded, smart teams. Generally it seemed to take at least five years after choosing to multi-thread to "get back to where they should be, performance-wise" with MP/MT products; most were still meaningfully improving MP/MT efficiency/scalability even ten years after making the shift.

So my speculation is that, absent GvR's endorsement and support, no one has taken on the long trudge for Python and its GIL. Even if they were to do so today, it'd be Python 4.x timeframe before you'd say "Wow! We're really over the MT hump!"

Perhaps there is some magic that separates Python and its runtime from all the other stateful infrastructure software--all the language runtimes, operating systems, transaction monitors, and database managers that have gone before. But if so, it's unique or nearly so. Everyone else removing a GIL-equivalent has taken five-plus years of hard, committed effort and investment to get from MT-not to MT-hot.