How does schedule()+switch_to() functions from linux kernel actually work

context-switchlinux-kernel

I'm trying to understand how the schedule process in linux kernel actually works. My question is not about the scheduling algorithm. Its about how the functions schedule() and switch_to() work.

I'll try to explain. I saw that:

When a process runs out of time-slice, the flag need_resched is set by scheduler_tick(). The kernel checks the flag, sees that it is set, and calls schedule() (pertinent to question 1) to switch to a new process. This flag is a message that schedule should be invoked as soon as possible because another process deserves to run.
Upon returning to user-space or returning from an interrupt, the need_resched flag is checked. If it is set, the kernel invokes the scheduler before continuing.

Looking into the kernel source (linux-2.6.10 – version that the book "Linux Kernel Development, second edition" is based on), I also saw that some codes can call the schedule() function voluntarily, giving another process the right to run.
I saw that the function switch_to() is the one that actually does the context switch. I looked into some architecture dependent codes, trying to understand what switch_to() was actually doing.

That behavior raised some questions that I could not find the answers for :

When switch_to() finishes, what is the current running process? The process that called schedule()? Or the next process, the one that was picked to run?
When schedule() gets called by an interrupt, the selected process to run starts to run when the interrupt handling finishes (after some kind of RTE) ? Or before that?
If the schedule() function can not be called from an interrupt, when is the flag- need_resched set?
When the timer interrupt handler is working, what stack is being used?

I don't know if I could make myself clear. If I couldn't, I hope I can do this after some answers (or questions).
I already looked at several sources trying to understand that process. I have the book "Linux Kernel Development, sec ed", and I'm using it too.
I know a bit about MIPs and H8300 architecture, if that help to explain.

Best Answer

After calling switch_to(), the kernel stack is switched to that of the task named in next. Changing the address space, etc, is handled in eg context_switch().
schedule() cannot be called in atomic context, including from an interrupt (see the check in schedule_debug()). If a reschedule is needed, the TIF_NEED_RESCHED task flag is set, which is checked in the interrupt return path.
See 2.
I believe that, with the default 8K stacks, Interrupts are handled with whatever kernel stack is currently executing. If 4K stacks are used, I believe there's a separate interrupt stack (automatically loaded thanks to some x86 magic), but I'm not completely certain on that point.

To be a bit more detailed, here's a practical example:

An interrupt occurs. The CPU switches to an interrupt trampoline routine, which pushes the interrupt number onto the stack, then jmps to common_interrupt
common_interrupt calls do_IRQ, which disables preemption then handles the IRQ
At some point, a decision is made to switch tasks. This may be from the timer interrupt, or from a wakeup call. In either case, set_task_need_resched is invoked, setting the TIF_NEED_RESCHED task flag.
Eventually, the CPU returns from do_IRQ in the original interrupt, and proceeds to the IRQ exit path. If this IRQ was invoked from within the kernel, it checks whether TIF_NEED_RESCHED is set, and if so calls preempt_schedule_irq, which briefly enables interrupts while performing a schedule().
If the IRQ was invoked from userspace, we first check whether there's anything that needs doing prior to returning. If so, we go to retint_careful, which checks both for a pending reschedule (and directly invokes schedule() if needed) as well as checking for pending signals, then goes back for another round at retint_check until there's no more important flags set.
Finally, we restore GS and return from the interrupt handler.

As for switch_to(); what switch_to() (on x86-32) does is:

Save the current values of EIP (instruction pointer) and ESP (stack pointer) for when we return to this task at some point later.
Switch the value of current_task. At this point, current now points to the new task.
Switch to the new stack, then push the EIP saved by the task we're switching to onto the stack. Later, a return will be performed, using this EIP as the return address; this is how it jumps back to the old code that previously called switch_to()
Call __switch_to(). At this point, current points to the new task, and we're on the new task's stack, but various other CPU state hasn't been updated. __switch_to() handles switching the state of things like the FPU, segment descriptors, debug registers, etc.
Upon return from __switch_to(), the return address that switch_to() manually pushed onto the stack is returned to, placing execution back where it was prior to the switch_to() in the new task. Execution has now fully resumed on the switched-to task.

x86-64 is very similar, but has to do slightly more saving/restoration of state due to the different ABI.

Best Answer

Related Solutions

Linux – How do the likely/unlikely macros in the Linux kernel work and what is their benefit

Linux – Context switch in Interrupt handlers

Related Topic