Python – What are the differences between the threading and multiprocessing modules

multiprocessingmultithreadingparallel-processingprocesspython

I am learning how to use the threading and the multiprocessing modules in Python to run certain operations in parallel and speed up my code.

I am finding this hard (maybe because I don't have any theoretical background about it) to understand what the difference is between a threading.Thread() object and a multiprocessing.Process() one.

Also, it is not entirely clear to me how to instantiate a queue of jobs and having only 4 (for example) of them running in parallel, while the other wait for resources to free before being executed.

I find the examples in the documentation clear, but not very exhaustive; as soon as I try to complicate things a bit, I receive a lot of weird errors (like a method that can't be pickled, and so on).

So, when should I use the threading and multiprocessing modules?

Can you link me to some resources that explain the concepts behind these two modules and how to use them properly for complex tasks?

Best Answer

What Giulio Franco says is true for multithreading vs. multiprocessing in general.

However, Python^* has an added issue: There's a Global Interpreter Lock that prevents two threads in the same process from running Python code at the same time. This means that if you have 8 cores, and change your code to use 8 threads, it won't be able to use 800% CPU and run 8x faster; it'll use the same 100% CPU and run at the same speed. (In reality, it'll run a little slower, because there's extra overhead from threading, even if you don't have any shared data, but ignore that for now.)

There are exceptions to this. If your code's heavy computation doesn't actually happen in Python, but in some library with custom C code that does proper GIL handling, like a numpy app, you will get the expected performance benefit from threading. The same is true if the heavy computation is done by some subprocess that you run and wait on.

More importantly, there are cases where this doesn't matter. For example, a network server spends most of its time reading packets off the network, and a GUI app spends most of its time waiting for user events. One reason to use threads in a network server or GUI app is to allow you to do long-running "background tasks" without stopping the main thread from continuing to service network packets or GUI events. And that works just fine with Python threads. (In technical terms, this means Python threads give you concurrency, even though they don't give you core-parallelism.)

But if you're writing a CPU-bound program in pure Python, using more threads is generally not helpful.

Using separate processes has no such problems with the GIL, because each process has its own separate GIL. Of course you still have all the same tradeoffs between threads and processes as in any other languages—it's more difficult and more expensive to share data between processes than between threads, it can be costly to run a huge number of processes or to create and destroy them frequently, etc. But the GIL weighs heavily on the balance toward processes, in a way that isn't true for, say, C or Java. So, you will find yourself using multiprocessing a lot more often in Python than you would in C or Java.

Meanwhile, Python's "batteries included" philosophy brings some good news: It's very easy to write code that can be switched back and forth between threads and processes with a one-liner change.

If you design your code in terms of self-contained "jobs" that don't share anything with other jobs (or the main program) except input and output, you can use the concurrent.futures library to write your code around a thread pool like this:

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    executor.submit(job, argument)
    executor.map(some_function, collection_of_independent_things)
    # ...

You can even get the results of those jobs and pass them on to further jobs, wait for things in order of execution or in order of completion, etc.; read the section on Future objects for details.

Now, if it turns out that your program is constantly using 100% CPU, and adding more threads just makes it slower, then you're running into the GIL problem, so you need to switch to processes. All you have to do is change that first line:

with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:

The only real caveat is that your jobs' arguments and return values have to be pickleable (and not take too much time or memory to pickle) to be usable cross-process. Usually this isn't a problem, but sometimes it is.

But what if your jobs can't be self-contained? If you can design your code in terms of jobs that pass messages from one to another, it's still pretty easy. You may have to use threading.Thread or multiprocessing.Process instead of relying on pools. And you will have to create queue.Queue or multiprocessing.Queue objects explicitly. (There are plenty of other options—pipes, sockets, files with flocks, … but the point is, you have to do something manually if the automatic magic of an Executor is insufficient.)

But what if you can't even rely on message passing? What if you need two jobs to both mutate the same structure, and see each others' changes? In that case, you will need to do manual synchronization (locks, semaphores, conditions, etc.) and, if you want to use processes, explicit shared-memory objects to boot. This is when multithreading (or multiprocessing) gets difficult. If you can avoid it, great; if you can't, you will need to read more than someone can put into an SO answer.

From a comment, you wanted to know what's different between threads and processes in Python. Really, if you read Giulio Franco's answer and mine and all of our links, that should cover everything… but a summary would definitely be useful, so here goes:

Threads share data by default; processes do not.
As a consequence of (1), sending data between processes generally requires pickling and unpickling it.^**
As another consequence of (1), directly sharing data between processes generally requires putting it into low-level formats like Value, Array, and ctypes types.
Processes are not subject to the GIL.
On some platforms (mainly Windows), processes are much more expensive to create and destroy.
There are some extra restrictions on processes, some of which are different on different platforms. See Programming guidelines for details.
The threading module doesn't have some of the features of the multiprocessing module. (You can use multiprocessing.dummy to get most of the missing API on top of threads, or you can use higher-level modules like concurrent.futures and not worry about it.)

_{* It's not actually Python, the language, that has this issue, but CPython, the "standard" implementation of that language. Some other implementations don't have a GIL, like Jython.}

_{** If you're using the fork start method for multiprocessing—which you can on most non-Windows platforms—each child process gets any resources the parent had when the child was started, which can be another way to pass data to children.}

Classes as objects

Before understanding metaclasses, you need to master classes in Python. And Python has a very peculiar idea of what classes are, borrowed from the Smalltalk language.

In most languages, classes are just pieces of code that describe how to produce an object. That's kinda true in Python too:

>>> class ObjectCreator(object):
...       pass
...

>>> my_object = ObjectCreator()
>>> print(my_object)
<__main__.ObjectCreator object at 0x8974f2c>

But classes are more than that in Python. Classes are objects too.

Yes, objects.

As soon as you use the keyword class, Python executes it and creates an object. The instruction

>>> class ObjectCreator(object):
...       pass
...

creates in memory an object with the name ObjectCreator.

This object (the class) is itself capable of creating objects (the instances), and this is why it's a class.

But still, it's an object, and therefore:

you can assign it to a variable
you can copy it
you can add attributes to it
you can pass it as a function parameter

e.g.:

>>> print(ObjectCreator) # you can print a class because it's an object
<class '__main__.ObjectCreator'>
>>> def echo(o):
...       print(o)
...
>>> echo(ObjectCreator) # you can pass a class as a parameter
<class '__main__.ObjectCreator'>
>>> print(hasattr(ObjectCreator, 'new_attribute'))
False
>>> ObjectCreator.new_attribute = 'foo' # you can add attributes to a class
>>> print(hasattr(ObjectCreator, 'new_attribute'))
True
>>> print(ObjectCreator.new_attribute)
foo
>>> ObjectCreatorMirror = ObjectCreator # you can assign a class to a variable
>>> print(ObjectCreatorMirror.new_attribute)
foo
>>> print(ObjectCreatorMirror())
<__main__.ObjectCreator object at 0x8997b4c>

Creating classes dynamically

Since classes are objects, you can create them on the fly, like any object.

First, you can create a class in a function using class:

>>> def choose_class(name):
...     if name == 'foo':
...         class Foo(object):
...             pass
...         return Foo # return the class, not an instance
...     else:
...         class Bar(object):
...             pass
...         return Bar
...
>>> MyClass = choose_class('foo')
>>> print(MyClass) # the function returns a class, not an instance
<class '__main__.Foo'>
>>> print(MyClass()) # you can create an object from this class
<__main__.Foo object at 0x89c6d4c>

But it's not so dynamic, since you still have to write the whole class yourself.

Since classes are objects, they must be generated by something.

When you use the class keyword, Python creates this object automatically. But as with most things in Python, it gives you a way to do it manually.

Remember the function type? The good old function that lets you know what type an object is:

>>> print(type(1))
<type 'int'>
>>> print(type("1"))
<type 'str'>
>>> print(type(ObjectCreator))
<type 'type'>
>>> print(type(ObjectCreator()))
<class '__main__.ObjectCreator'>

Well, type has a completely different ability, it can also create classes on the fly. type can take the description of a class as parameters, and return a class.

(I know, it's silly that the same function can have two completely different uses according to the parameters you pass to it. It's an issue due to backward compatibility in Python)

type works this way:

type(name, bases, attrs)

Where:

name: name of the class
bases: tuple of the parent class (for inheritance, can be empty)
attrs: dictionary containing attributes names and values

e.g.:

>>> class MyShinyClass(object):
...       pass

can be created manually this way:

>>> MyShinyClass = type('MyShinyClass', (), {}) # returns a class object
>>> print(MyShinyClass)
<class '__main__.MyShinyClass'>
>>> print(MyShinyClass()) # create an instance with the class
<__main__.MyShinyClass object at 0x8997cec>

You'll notice that we use MyShinyClass as the name of the class and as the variable to hold the class reference. They can be different, but there is no reason to complicate things.

type accepts a dictionary to define the attributes of the class. So:

>>> class Foo(object):
...       bar = True

Can be translated to:

>>> Foo = type('Foo', (), {'bar':True})

And used as a normal class:

>>> print(Foo)
<class '__main__.Foo'>
>>> print(Foo.bar)
True
>>> f = Foo()
>>> print(f)
<__main__.Foo object at 0x8a9b84c>
>>> print(f.bar)
True

And of course, you can inherit from it, so:

>>>   class FooChild(Foo):
...         pass

would be:

>>> FooChild = type('FooChild', (Foo,), {})
>>> print(FooChild)
<class '__main__.FooChild'>
>>> print(FooChild.bar) # bar is inherited from Foo
True

Eventually, you'll want to add methods to your class. Just define a function with the proper signature and assign it as an attribute.

>>> def echo_bar(self):
...       print(self.bar)
...
>>> FooChild = type('FooChild', (Foo,), {'echo_bar': echo_bar})
>>> hasattr(Foo, 'echo_bar')
False
>>> hasattr(FooChild, 'echo_bar')
True
>>> my_foo = FooChild()
>>> my_foo.echo_bar()
True

And you can add even more methods after you dynamically create the class, just like adding methods to a normally created class object.

>>> def echo_bar_more(self):
...       print('yet another method')
...
>>> FooChild.echo_bar_more = echo_bar_more
>>> hasattr(FooChild, 'echo_bar_more')
True

You see where we are going: in Python, classes are objects, and you can create a class on the fly, dynamically.

This is what Python does when you use the keyword class, and it does so by using a metaclass.

What are metaclasses (finally)

Metaclasses are the 'stuff' that creates classes.

You define classes in order to create objects, right?

But we learned that Python classes are objects.

Well, metaclasses are what create these objects. They are the classes' classes, you can picture them this way:

MyClass = MetaClass()
my_object = MyClass()

You've seen that type lets you do something like this:

MyClass = type('MyClass', (), {})

It's because the function type is in fact a metaclass. type is the metaclass Python uses to create all classes behind the scenes.

Now you wonder "why the heck is it written in lowercase, and not Type?"

Well, I guess it's a matter of consistency with str, the class that creates strings objects, and int the class that creates integer objects. type is just the class that creates class objects.

You see that by checking the __class__ attribute.

Everything, and I mean everything, is an object in Python. That includes integers, strings, functions and classes. All of them are objects. And all of them have been created from a class:

>>> age = 35
>>> age.__class__
<type 'int'>
>>> name = 'bob'
>>> name.__class__
<type 'str'>
>>> def foo(): pass
>>> foo.__class__
<type 'function'>
>>> class Bar(object): pass
>>> b = Bar()
>>> b.__class__
<class '__main__.Bar'>

Now, what is the __class__ of any __class__ ?

>>> age.__class__.__class__
<type 'type'>
>>> name.__class__.__class__
<type 'type'>
>>> foo.__class__.__class__
<type 'type'>
>>> b.__class__.__class__
<type 'type'>

So, a metaclass is just the stuff that creates class objects.

You can call it a 'class factory' if you wish.

type is the built-in metaclass Python uses, but of course, you can create your own metaclass.

The `metaclass` attribute

In Python 2, you can add a __metaclass__ attribute when you write a class (see next section for the Python 3 syntax):

class Foo(object):
    __metaclass__ = something...
    [...]

If you do so, Python will use the metaclass to create the class Foo.

Careful, it's tricky.

You write class Foo(object) first, but the class object Foo is not created in memory yet.

Python will look for __metaclass__ in the class definition. If it finds it, it will use it to create the object class Foo. If it doesn't, it will use type to create the class.

Read that several times.

When you do:

class Foo(Bar):
    pass

Python does the following:

Is there a __metaclass__ attribute in Foo?

If yes, create in-memory a class object (I said a class object, stay with me here), with the name Foo by using what is in __metaclass__.

If Python can't find __metaclass__, it will look for a __metaclass__ at the MODULE level, and try to do the same (but only for classes that don't inherit anything, basically old-style classes).

Then if it can't find any __metaclass__ at all, it will use the Bar's (the first parent) own metaclass (which might be the default type) to create the class object.

Be careful here that the __metaclass__ attribute will not be inherited, the metaclass of the parent (Bar.__class__) will be. If Bar used a __metaclass__ attribute that created Bar with type() (and not type.__new__()), the subclasses will not inherit that behavior.

Now the big question is, what can you put in __metaclass__?

The answer is something that can create a class.

And what can create a class? type, or anything that subclasses or uses it.

Metaclasses in Python 3

The syntax to set the metaclass has been changed in Python 3:

class Foo(object, metaclass=something):
    ...

i.e. the __metaclass__ attribute is no longer used, in favor of a keyword argument in the list of base classes.

The behavior of metaclasses however stays largely the same.

One thing added to metaclasses in Python 3 is that you can also pass attributes as keyword-arguments into a metaclass, like so:

class Foo(object, metaclass=something, kwarg1=value1, kwarg2=value2):
    ...

Read the section below for how Python handles this.

Custom metaclasses

The main purpose of a metaclass is to change the class automatically, when it's created.

You usually do this for APIs, where you want to create classes matching the current context.

Imagine a stupid example, where you decide that all classes in your module should have their attributes written in uppercase. There are several ways to do this, but one way is to set __metaclass__ at the module level.

This way, all classes of this module will be created using this metaclass, and we just have to tell the metaclass to turn all attributes to uppercase.

Luckily, __metaclass__ can actually be any callable, it doesn't need to be a formal class (I know, something with 'class' in its name doesn't need to be a class, go figure... but it's helpful).

So we will start with a simple example, by using a function.

# the metaclass will automatically get passed the same argument
# that you usually pass to `type`
def upper_attr(future_class_name, future_class_parents, future_class_attrs):
    """
      Return a class object, with the list of its attribute turned
      into uppercase.
    """
    # pick up any attribute that doesn't start with '__' and uppercase it
    uppercase_attrs = {
        attr if attr.startswith("__") else attr.upper(): v
        for attr, v in future_class_attrs.items()
    }

    # let `type` do the class creation
    return type(future_class_name, future_class_parents, uppercase_attrs)

__metaclass__ = upper_attr # this will affect all classes in the module

class Foo(): # global __metaclass__ won't work with "object" though
    # but we can define __metaclass__ here instead to affect only this class
    # and this will work with "object" children
    bar = 'bip'

Let's check:

>>> hasattr(Foo, 'bar')
False
>>> hasattr(Foo, 'BAR')
True
>>> Foo.BAR
'bip'

Now, let's do exactly the same, but using a real class for a metaclass:

# remember that `type` is actually a class like `str` and `int`
# so you can inherit from it
class UpperAttrMetaclass(type):
    # __new__ is the method called before __init__
    # it's the method that creates the object and returns it
    # while __init__ just initializes the object passed as parameter
    # you rarely use __new__, except when you want to control how the object
    # is created.
    # here the created object is the class, and we want to customize it
    # so we override __new__
    # you can do some stuff in __init__ too if you wish
    # some advanced use involves overriding __call__ as well, but we won't
    # see this
    def __new__(upperattr_metaclass, future_class_name,
                future_class_parents, future_class_attrs):
        uppercase_attrs = {
            attr if attr.startswith("__") else attr.upper(): v
            for attr, v in future_class_attrs.items()
        }
        return type(future_class_name, future_class_parents, uppercase_attrs)

Let's rewrite the above, but with shorter and more realistic variable names now that we know what they mean:

class UpperAttrMetaclass(type):
    def __new__(cls, clsname, bases, attrs):
        uppercase_attrs = {
            attr if attr.startswith("__") else attr.upper(): v
            for attr, v in attrs.items()
        }
        return type(clsname, bases, uppercase_attrs)

You may have noticed the extra argument cls. There is nothing special about it: __new__ always receives the class it's defined in, as the first parameter. Just like you have self for ordinary methods which receive the instance as the first parameter, or the defining class for class methods.

But this is not proper OOP. We are calling type directly and we aren't overriding or calling the parent's __new__. Let's do that instead:

class UpperAttrMetaclass(type):
    def __new__(cls, clsname, bases, attrs):
        uppercase_attrs = {
            attr if attr.startswith("__") else attr.upper(): v
            for attr, v in attrs.items()
        }
        return type.__new__(cls, clsname, bases, uppercase_attrs)

We can make it even cleaner by using super, which will ease inheritance (because yes, you can have metaclasses, inheriting from metaclasses, inheriting from type):

class UpperAttrMetaclass(type):
    def __new__(cls, clsname, bases, attrs):
        uppercase_attrs = {
            attr if attr.startswith("__") else attr.upper(): v
            for attr, v in attrs.items()
        }
        return super(UpperAttrMetaclass, cls).__new__(
            cls, clsname, bases, uppercase_attrs)

Oh, and in Python 3 if you do this call with keyword arguments, like this:

class Foo(object, metaclass=MyMetaclass, kwarg1=value1):
    ...

It translates to this in the metaclass to use it:

class MyMetaclass(type):
    def __new__(cls, clsname, bases, dct, kwargs1=default):
        ...

That's it. There is really nothing more about metaclasses.

The reason behind the complexity of the code using metaclasses is not because of metaclasses, it's because you usually use metaclasses to do twisted stuff relying on introspection, manipulating inheritance, vars such as __dict__, etc.

Indeed, metaclasses are especially useful to do black magic, and therefore complicated stuff. But by themselves, they are simple:

intercept a class creation
modify the class
return the modified class

Why would you use metaclasses classes instead of functions?

Since __metaclass__ can accept any callable, why would you use a class since it's obviously more complicated?

There are several reasons to do so:

The intention is clear. When you read UpperAttrMetaclass(type), you know what's going to follow
You can use OOP. Metaclass can inherit from metaclass, override parent methods. Metaclasses can even use metaclasses.
Subclasses of a class will be instances of its metaclass if you specified a metaclass-class, but not with a metaclass-function.
You can structure your code better. You never use metaclasses for something as trivial as the above example. It's usually for something complicated. Having the ability to make several methods and group them in one class is very useful to make the code easier to read.
You can hook on __new__, __init__ and __call__. Which will allow you to do different stuff, Even if usually you can do it all in __new__, some people are just more comfortable using __init__.
These are called metaclasses, damn it! It must mean something!

Why would you use metaclasses?

Now the big question. Why would you use some obscure error-prone feature?

Well, usually you don't:

Metaclasses are deeper magic that 99% of users should never worry about it. If you wonder whether you need them, you don't (the people who actually need them to know with certainty that they need them and don't need an explanation about why).

Python Guru Tim Peters

The main use case for a metaclass is creating an API. A typical example of this is the Django ORM. It allows you to define something like this:

class Person(models.Model):
    name = models.CharField(max_length=30)
    age = models.IntegerField()

But if you do this:

person = Person(name='bob', age='35')
print(person.age)

It won't return an IntegerField object. It will return an int, and can even take it directly from the database.

This is possible because models.Model defines __metaclass__ and it uses some magic that will turn the Person you just defined with simple statements into a complex hook to a database field.

Django makes something complex look simple by exposing a simple API and using metaclasses, recreating code from this API to do the real job behind the scenes.

The last word

First, you know that classes are objects that can create instances.

Well, in fact, classes are themselves instances. Of metaclasses.

>>> class Foo(object): pass
>>> id(Foo)
142630324

Everything is an object in Python, and they are all either instance of classes or instances of metaclasses.

Except for type.

type is actually its own metaclass. This is not something you could reproduce in pure Python, and is done by cheating a little bit at the implementation level.

Secondly, metaclasses are complicated. You may not want to use them for very simple class alterations. You can change classes by using two different techniques:

monkey patching
class decorators

99% of the time you need class alteration, you are better off using these.

But 98% of the time, you don't need class alteration at all.

Python – Difference between staticmethod and classmethod

Maybe a bit of example code will help: Notice the difference in the call signatures of foo, class_foo and static_foo:

class A(object):
    def foo(self, x):
        print(f"executing foo({self}, {x})")

    @classmethod
    def class_foo(cls, x):
        print(f"executing class_foo({cls}, {x})")

    @staticmethod
    def static_foo(x):
        print(f"executing static_foo({x})")

a = A()

Below is the usual way an object instance calls a method. The object instance, a, is implicitly passed as the first argument.

a.foo(1)
# executing foo(<__main__.A object at 0xb7dbef0c>, 1)

With classmethods, the class of the object instance is implicitly passed as the first argument instead of self.

a.class_foo(1)
# executing class_foo(<class '__main__.A'>, 1)

You can also call class_foo using the class. In fact, if you define something to be a classmethod, it is probably because you intend to call it from the class rather than from a class instance. A.foo(1) would have raised a TypeError, but A.class_foo(1) works just fine:

A.class_foo(1)
# executing class_foo(<class '__main__.A'>, 1)

One use people have found for class methods is to create inheritable alternative constructors.

With staticmethods, neither self (the object instance) nor cls (the class) is implicitly passed as the first argument. They behave like plain functions except that you can call them from an instance or the class:

a.static_foo(1)
# executing static_foo(1)

A.static_foo('hi')
# executing static_foo(hi)

Staticmethods are used to group functions which have some logical connection with a class to the class.

foo is just a function, but when you call a.foo you don't just get the function, you get a "partially applied" version of the function with the object instance a bound as the first argument to the function. foo expects 2 arguments, while a.foo only expects 1 argument.

a is bound to foo. That is what is meant by the term "bound" below:

print(a.foo)
# <bound method A.foo of <__main__.A object at 0xb7d52f0c>>

With a.class_foo, a is not bound to class_foo, rather the class A is bound to class_foo.

print(a.class_foo)
# <bound method type.class_foo of <class '__main__.A'>>

Here, with a staticmethod, even though it is a method, a.static_foo just returns a good 'ole function with no arguments bound. static_foo expects 1 argument, and a.static_foo expects 1 argument too.

print(a.static_foo)
# <function static_foo at 0xb7d479cc>

And of course the same thing happens when you call static_foo with the class A instead.

print(A.static_foo)
# <function static_foo at 0xb7d479cc>