How expensive are kernel context switches compared to userspace context switches

kernelmultithreading

According to C10k and this paper, throughput of 1-thread-per-connection servers degrade as more and more clients connect and more and more threads are created. According to those two sources, this is because the more threads exist, the more time is spent on context switching compared to actual work done by those threads. Evented servers don't seem to suffer as much from performance degredation at high connection counts.

However, evented servers also do context switches between clients, they just do it in userspace.

Why are these userspace context switches faster than kernel thread context switches?
What exactly does a kernel context switch do that's so much more expensive?
How expensive is a kernel context switch exactly? How much time does it take?
Does kernel context switching time depend on the number of threads?

I'm mostly interested in how the Linux kernel handles context switching but information about other OSes is welcome too.

Best Answer

Why are these userspace context switches faster than kernel thread context switches?

Because the CPU does not need to switch to kernel mode and back to user mode.

What exactly does a kernel context switch do that's so much more expensive?

Mostly the switch to kernel mode. IIRC, the page tables are the same in kernel mode and user mode in Linux, so at least there is no TLB invalidation penalty.

How expensive is a kernel context switch exactly? How much time does it take?

Needs to be measured and can vary from machine to machine. I guess that a typical desktop/server machine these days can do a few hundred thousands of context switches per second, probably a few million.

Does kernel context switching time depend on the number of threads?

Depends on how the kernel scheduler handles this. AFAIK, in Linux it is pretty efficient, even with large thread counts, but more threads means more memory usage means more cache pressure and thus likely lower performance. I also expect some overhead involved in the handling of thousands of sockets.

Related Solutions

Linux – Context switches much slower in new linux kernels

The solution to the bad thread wake up performance problem in recent kernels has to do with the switch to the intel_idle cpuidle driver from acpi_idle, the driver used in older kernels. Sadly, the intel_idle driver ignores the user's BIOS configuration for the C-states and dances to its own tune. In other words, even if you completely disable all C states in your PC's (or server's) BIOS, this driver will still force them on during periods of brief inactivity, which are almost always happening unless an all core consuming synthetic benchmark (e.g., stress) is running. You can monitor C state transitions, along with other useful information related to processor frequencies, using the wonderful Google i7z tool on most compatible hardware.

To see which cpuidle driver is currently active in your setup, just cat the current_driver file in the cpuidle section of /sys/devices/system/cpu as follows:

cat /sys/devices/system/cpu/cpuidle/current_driver

If you want your modern Linux OS to have the lowest context switch latency possible, add the following kernel boot parameters to disable all of these power saving features:

On Ubuntu 12.04, you can do this by adding them to the GRUB_CMDLINE_LINUX_DEFAULT entry in /etc/default/grub and then running update-grub. The boot parameters to add are:

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll

Here are the gory details about what the three boot options do:

Setting intel_idle.max_cstate to zero will either revert your cpuidle driver to acpi_idle (at least per the documentation of the option), or disable it completely. On my box it is completely disabled (i.e., displaying the current_driver file in /sys/devices/system/cpu/cpuidle produces an output of none). In this case the second boot option, processor.max_cstate=0 is unnecessary. However, the documentation states that setting max_cstate to zero for the intel_idle driver should revert the OS to the acpi_idle driver. Therefore, I put in the second boot option just in case.

The processor.max_cstate option sets the maximum C state for the acpi_idle driver to zero, hopefully disabling it as well. I do not have a system that I can test this on, because intel_idle.max_cstate=0 completely knocks out the cpuidle driver on all of the hardware available to me. However, if your installation does revert you from intel_idle to acpi_idle with just the first boot option, please let me know if the second option, processor.max_cstate did what it was documented to do in the comments so that I can update this answer.

Finally, the last of the three parameters, idle=poll is a real power hog. It will disable C1/C1E, which will remove the final remaining bit of latency at the expense of a lot more power consumption, so use that one only when it's really necessary. For most this will be overkill, since the C1* latency is not all that large. Using my test application running on the hardware I described in the original question, the latency went from 9 us to 3 us. This is certainly a significant reduction for highly latency sensitive applications (e.g., financial trading, high precision telemetry/tracking, high freq. data acquisition, etc...), but may not be worth the incurred electrical power hit for the vast majority of desktop apps. The only way to know for sure is to profile your application's improvement in performance vs. the actual increase in power consumption/heat of your hardware and weigh the tradeoffs.

Update:

After additional testing with various idle=* parameters, I have discovered that setting idle to mwait if supported by your hardware is a much better idea. It seems that the use of the MWAIT/MONITOR instructions allows the CPU to enter C1E without any noticeable latency being added to the thread wake up time. With idle=mwait, you will get cooler CPU temperatures (as compared to idle=poll), less power use and still retain the excellent low latencies of a polling idle loop. Therefore, my updated recommended set of boot parameters for low CPU thread wake up latency based on these findings is:

intel_idle.max_cstate=0 processor.max_cstate=0 idle=mwait

The use of idle=mwait instead of idle=poll may also help with the initiation of Turbo Boost (by helping the CPU stay below its TDP [Thermal Design Power]) and hyperthreading (for which MWAIT is the ideal mechanism for not consuming an entire physical core while at the same time avoiding the higher C states). This has yet to be proven in testing, however, which I will continue to do.

Update 2:

The mwait idle option has been removed from newer 3.x kernels (thanks to user ck_ for the update). That leaves us with two options:

idle=halt - Should work as well as mwait, but test to be sure that this is the case with your hardware. The HLT instruction is almost equivalent to an MWAIT with state hint 0. The problem lies in the fact that an interrupt is required to get out of a HLT state, while a memory write (or interrupt) can be used to get out of the MWAIT state. Depending on what the Linux Kernel uses in its idle loop, this can make MWAIT potentially more efficient. So, as I said test/profile and see if it meets your latency needs...

and

idle=poll - The highest performance option, at the expense of power and heat.

C++ – How are user-level threads scheduled/created, and how are kernel level threads created

This is prefaced by the top comments.

The documentation you're reading is generic [not linux specific] and a bit outdated. And, more to the point, it is using different terminology. That is, I believe, the source of the confusion. So, read on ...

What it calls a "user-level" thread is what I'm calling an [outdated] LWP thread. What it calls a "kernel-level" thread is what is called a native thread in linux. Under linux, what is called a "kernel" thread is something else altogether [See below].

using pthreads create threads in the userspace, and the kernel is not aware about this and view it as a single process only, unaware of how many threads are inside.

This was how userspace threads were done prior to the NPTL (native posix threads library). This is also what SunOS/Solaris called an LWP lightweight process.

There was one process that multiplexed itself and created threads. IIRC, it was called the thread master process [or some such]. The kernel was not aware of this. The kernel didn't yet understand or provide support for threads.

But, because, these "lightweight" threads were switched by code in the userspace based thread master (aka "lightweight process scheduler") [just a special user program/process], they were very slow to switch context.

Also, before the advent of "native" threads, you might have 10 processes. Each process gets 10% of the CPU. If one of the processes was an LWP that had 10 threads, these threads had to share that 10% and, thus, got only 1% of the CPU each.

All this was replaced by the "native" threads that the kernel's scheduler is aware of. This changeover was done 10-15 years ago.

Now, with the above example, we have 20 threads/processes that each get 5% of the CPU. And, the context switch is much faster.

It is still possible to have an LWP system under a native thread, but, now, that is a design choice, rather than a necessity.

Further, LWP works great if each thread "cooperates". That is, each thread loop periodically makes an explicit call to a "context switch" function. It is voluntarily relinquishing the process slot so another LWP can run.

However, the pre-NPTL implementation in glibc also had to [forcibly] preempt LWP threads (i.e. implement timeslicing). I can't remember the exact mechanism used, but, here's an example. The thread master had to set an alarm, go to sleep, wake up and then send the active thread a signal. The signal handler would effect the context switch. This was messy, ugly, and somewhat unreliable.

Joachim mentioned pthread_create function creates a kernel thread

That is [technically] incorrect to call it a kernel thread. pthread_create creates a native thread. This is run in userspace and vies for timeslices on an equal footing with processes. Once created there is little difference between a thread and a process.

The primary difference is that a process has its own unique address space. A thread, however, is a process that shares its address space with other process/threads that are part of the same thread group.

If it doesn't create a kernel level thread, then how are kernel threads created from userspace programs?

Kernel threads are not userspace threads, NPTL, native, or otherwise. They are created by the kernel via the kernel_thread function. They run as part of the kernel and are not associated with any userspace program/process/thread. They have full access to the machine. Devices, MMU, etc. Kernel threads run in the highest privilege level: ring 0. They also run in the kernel's address space and not the address space of any user process/thread.

A userspace program/process may not create a kernel thread. Remember, it creates a native thread using pthread_create, which invokes the clone syscall to do so.

Threads are useful to do things, even for the kernel. So, it runs some of its code in various threads. You can see these threads by doing ps ax. Look and you'll see kthreadd, ksoftirqd, kworker, rcu_sched, rcu_bh, watchdog, migration, etc. These are kernel threads and not programs/processes.

UPDATE:

You mentioned that kernel doesn't know about user threads.

Remember that, as mentioned above, there are two "eras".

(1) Before the kernel got thread support (circa 2004?). This used the thread master (which, here, I'll call the LWP scheduler). The kernel just had the fork syscall.

(2) All kernels after that which do understand threads. There is no thread master, but, we have pthreads and the clone syscall. Now, fork is implemented as clone. clone is similar to fork but takes some arguments. Notably, a flags argument and a child_stack argument.

More on this below ...

then, how is it possible for user level threads to have individual stacks?

There is nothing "magic" about a processor stack. I'll confine discussion [mostly] to x86, but this would be applicable to any architecture, even those that don't even have a stack register (e.g. 1970's era IBM mainframes, such as the IBM System 370)

Under x86, the stack pointer is %rsp. The x86 has push and pop instructions. We use these to save and restore things: push %rcx and [later] pop %rcx.

But, suppose the x86 did not have %rsp or push/pop instructions? Could we still have a stack? Sure, by convention. We [as programmers] agree that (e.g.) %rbx is the stack pointer.

In that case, a "push" of %rcx would be [using AT&T assembler]:

subq    $8,%rbx
movq    %rcx,0(%rbx)

And, a "pop" of %rcx would be:

movq    0(%rbx),%rcx
addq    $8,%rbx

To make it easier, I'm going to switch to C "pseudo code". Here are the above push/pop in pseudo code:

// push %ecx
    %rbx -= 8;
    0(%rbx) = %ecx;

// pop %ecx
    %ecx = 0(%rbx);
    %rbx += 8;

To create a thread, the LWP scheduler had to create a stack area using malloc. It then had to save this pointer in a per-thread struct, and then kick off the child LWP. The actual code is a bit tricky, assume we have an (e.g.) LWP_create function that is similar to pthread_create:

typedef void * (*LWP_func)(void *);

// per-thread control
typedef struct tsk tsk_t;
struct tsk {
    tsk_t *tsk_next;                    //
    tsk_t *tsk_prev;                    //
    void *tsk_stack;                    // stack base
    u64 tsk_regsave[16];
};

// list of tasks
typedef struct tsklist tsklist_t;
struct tsklist {
    tsk_t *tsk_next;                    //
    tsk_t *tsk_prev;                    //
};

tsklist_t tsklist;                      // list of tasks

tsk_t *tskcur;                          // current thread

// LWP_switch -- switch from one task to another
void
LWP_switch(tsk_t *to)
{

    // NOTE: we use (i.e.) burn register values as we do our work. in a real
    // implementation, we'd have to push/pop these in a special way. so, just
    // pretend that we do that ...

    // save all registers into tskcur->tsk_regsave
    tskcur->tsk_regsave[RAX] = %rax;
    // ...

    tskcur = to;

    // restore most registers from tskcur->tsk_regsave
    %rax = tskcur->tsk_regsave[RAX];
    // ...

    // set stack pointer to new task's stack
    %rsp = tskcur->tsk_regsave[RSP];

    // set resume address for task
    push(%rsp,tskcur->tsk_regsave[RIP]);

    // issue "ret" instruction
    ret();
}

// LWP_create -- start a new LWP
tsk_t *
LWP_create(LWP_func start_routine,void *arg)
{
    tsk_t *tsknew;

    // get per-thread struct for new task
    tsknew = calloc(1,sizeof(tsk_t));
    append_to_tsklist(tsknew);

    // get new task's stack
    tsknew->tsk_stack = malloc(0x100000)
    tsknew->tsk_regsave[RSP] = tsknew->tsk_stack;

    // give task its argument
    tsknew->tsk_regsave[RDI] = arg;

    // switch to new task
    LWP_switch(tsknew);

    return tsknew;
}

// LWP_destroy -- destroy an LWP
void
LWP_destroy(tsk_t *tsk)
{

    // free the task's stack
    free(tsk->tsk_stack);

    remove_from_tsklist(tsk);

    // free per-thread struct for dead task
    free(tsk);
}

With a kernel that understands threads, we use pthread_create and clone, but we still have to create the new thread's stack. The kernel does not create/assign a stack for a new thread. The clone syscall accepts a child_stack argument. Thus, pthread_create must allocate a stack for the new thread and pass that to clone:

// pthread_create -- start a new native thread
tsk_t *
pthread_create(LWP_func start_routine,void *arg)
{
    tsk_t *tsknew;

    // get per-thread struct for new task
    tsknew = calloc(1,sizeof(tsk_t));
    append_to_tsklist(tsknew);

    // get new task's stack
    tsknew->tsk_stack = malloc(0x100000)

    // start up thread
    clone(start_routine,tsknew->tsk_stack,CLONE_THREAD,arg);

    return tsknew;
}

// pthread_join -- destroy an LWP
void
pthread_join(tsk_t *tsk)
{

    // wait for thread to die ...

    // free the task's stack
    free(tsk->tsk_stack);

    remove_from_tsklist(tsk);

    // free per-thread struct for dead task
    free(tsk);
}

Only a process or main thread is assigned its initial stack by the kernel, usually at a high memory address. So, if the process does not use threads, normally, it just uses that pre-assigned stack.

But, if a thread is created, either an LWP or a native one, the starting process/thread must pre-allocate the area for the proposed thread with malloc. Side note: Using malloc is the normal way, but the thread creator could just have a large pool of global memory: char stack_area[MAXTASK][0x100000]; if it wished to do it that way.

If we had an ordinary program that does not use threads [of any type], it may wish to "override" the default stack it has been given.

That process could decide to use malloc and the above assembler trickery to create a much larger stack if it were doing a hugely recursive function.

See my answer here: What is the difference between user defined stack and built in stack in use of memory?

Best Answer

Related Solutions

Linux – Context switches much slower in new linux kernels

C++ – How are user-level threads scheduled/created, and how are kernel level threads created

Related Topic