Not all virtual (linear) addresses must be mapped to anything. If the code accesses unmapped page, the page fault is risen.
The physical page can be mapped to several virtual addresses simultaneously.
In the 4 GB virtual memory there are 2 sections: 0x0... 0xbfffffff - is process virtual memory and 0xc0000000 .. 0xffffffff is a kernel virtual memory.
- How can the kernel map 896 MB from only 512 MB ?
It maps up to 896 MB. So, if you have only 512, there will be only 512 MB mapped.
If your physical memory is in 0x00000000 to 0x20000000, it will be mapped for direct kernel access to virtual addresses 0xC0000000 to 0xE0000000 (linear mapping).
- What about user mode processes in this situation?
Phys memory for user processes will be mapped (not sequentially but rather random page-to-page mapping) to virtual addresses 0x0 .... 0xc0000000. This mapping will be the second mapping for pages from 0..896MB. The pages will be taken from free page lists.
- Where are user mode processes in phys RAM?
Anywhere.
- Every article explains only the situation, when you've installed 4 GB of memory and the
No. Every article explains how 4 Gb of virtual address space is mapped. The size of virtual memory is always 4 GB (for 32-bit machine without memory extensions like PAE/PSE/etc for x86)
As stated in 8.1.3. Memory Zones
of the book Linux Kernel Development
by Robert Love (I use third edition), there are several zones of physical memory:
- ZONE_DMA - Contains page frames of memory below 16 MB
- ZONE_NORMAL - Contains page frames of memory at and above 16 MB and below 896 MB
- ZONE_HIGHMEM - Contains page frames of memory at and above 896 MB
So, if you have 512 MB, your ZONE_HIGHMEM will be empty, and ZONE_NORMAL will have 496 MB of physical memory mapped.
Also, take a look to 2.5.5.2. Final kernel Page Table when RAM size is less than 896 MB
section of the book. It is about case, when you have less memory than 896 MB.
Also, for ARM there is some description of virtual memory layout: http://www.mjmwired.net/kernel/Documentation/arm/memory.txt
The line 63 PAGE_OFFSET high_memory-1
is the direct mapped part of memory
- After calling
switch_to()
, the kernel stack is switched to that of the task named in next
. Changing the address space, etc, is handled in eg context_switch()
.
schedule()
cannot be called in atomic context, including from an interrupt (see the check in schedule_debug()
). If a reschedule is needed, the TIF_NEED_RESCHED task flag is set, which is checked in the interrupt return path.
- See 2.
- I believe that, with the default 8K stacks, Interrupts are handled with whatever kernel stack is currently executing. If 4K stacks are used, I believe there's a separate interrupt stack (automatically loaded thanks to some x86 magic), but I'm not completely certain on that point.
To be a bit more detailed, here's a practical example:
- An interrupt occurs. The CPU switches to an interrupt trampoline routine, which pushes the interrupt number onto the stack, then jmps to common_interrupt
- common_interrupt calls do_IRQ, which disables preemption then handles the IRQ
- At some point, a decision is made to switch tasks. This may be from the timer interrupt, or from a wakeup call. In either case, set_task_need_resched is invoked, setting the TIF_NEED_RESCHED task flag.
- Eventually, the CPU returns from do_IRQ in the original interrupt, and proceeds to the IRQ exit path. If this IRQ was invoked from within the kernel, it checks whether TIF_NEED_RESCHED is set, and if so calls preempt_schedule_irq, which briefly enables interrupts while performing a
schedule()
.
- If the IRQ was invoked from userspace, we first check whether there's anything that needs doing prior to returning. If so, we go to retint_careful, which checks both for a pending reschedule (and directly invokes
schedule()
if needed) as well as checking for pending signals, then goes back for another round at retint_check
until there's no more important flags set.
- Finally, we restore GS and return from the interrupt handler.
As for switch_to()
; what switch_to()
(on x86-32) does is:
- Save the current values of EIP (instruction pointer) and ESP (stack pointer) for when we return to this task at some point later.
- Switch the value of
current_task
. At this point, current
now points to the new task.
- Switch to the new stack, then push the EIP saved by the task we're switching to onto the stack. Later, a return will be performed, using this EIP as the return address; this is how it jumps back to the old code that previously called
switch_to()
- Call __switch_to(). At this point,
current
points to the new task, and we're on the new task's stack, but various other CPU state hasn't been updated. __switch_to()
handles switching the state of things like the FPU, segment descriptors, debug registers, etc.
- Upon return from
__switch_to()
, the return address that switch_to()
manually pushed onto the stack is returned to, placing execution back where it was prior to the switch_to()
in the new task. Execution has now fully resumed on the switched-to task.
x86-64 is very similar, but has to do slightly more saving/restoration of state due to the different ABI.
Best Answer
The ARM SMP systems support two types of interrupts. SPI (shared peripheral interrupt) and PPI (peripheral private interrupts). The PPI is a per-CPU interrupt source. A special case for SMP of the PPI is an SGI (software generated interrupt); this is a CPU-to-CPU interrupt that is used to signal from one CPU to another in the SMP world (called IPI).Note1
A PPI timer can be used to allow each CPU to use 'tickless scheduling'; that is timer interrupts are scheduled via knowledge of future time events (google timing wheel, look at the
NO_HZ
documentation, etc). The current Linux kernel doesn't use this specific PPI timer for scheduling. It is only used as a delay loop time source. Instead the Global PPI timer is used. This timer can interrupt each CPU selectively, but the register set is global to all CPUs. A particular CPU may schedule an interrupt for itself; with the time base being global.The complication is that tasks must be migrated from one CPU to another in order to balance work among CPUs. Also, the Linux kernel's core code/scheduler is written for multiple CPUs (or architectures) and they may not have these per-CPU interrupt sources. An definitive answer may depend on your kernel version and the scheduler used (or more generally kernel configuration). Generally, a busy CPU will do the migration, other CPUs may wake on a timer tick just to see if a task in it's set should run (maybe a migrated process). If
NO_HZ
is in effect, some CPUs may not wake at all; they will get an IPI in the case of migration.In any case, there is nothing that is ARM specific in the CPU scheduling besides the clock source. It is possible for an ARM SMP system to not have the a global PPI timer. In this case, every CPU may wake to service an interrupt, but the majority may sleep immediately. This could happen on any system due to a bad timer/interrupt controller design or a bad system configuration. However, even in these cases, the code would not call into the scheduler except where needed.
See: Linux Scheduler on SMP (which maybe a duplicate although the answer is not great IMO), IBMs completely fair scheduler article and O'Reillys Linux Kernel scheduler chapter.
Note1: This is actually GIC (or generic interrupt controller) terminology. However, most ARM SMP systems use this interrupt controller. It is bundled with Cortex-A CPUs and came as an external soft-component for some ARMv6 systems. It is possible for an ARM SMP systems to use another controller, but it is probably extremely rare or non-existent.
Edit: There are two ARM on-chip timers; these are useful as every Cortex-A has them compared to SOC vender timers. One of them is used instead of a 'counting loop' for a delay. This works better in the case of interrupts. I don't think it is critical to understand SMP scheduling, you may ignore that comment and just know that that source file is not used for scheduling. It was the first one I looked at. If you find it really distracting, I will remove that information.
See this paper on timing wheels; it is about 'IP'/networking, but the concept of
NO_HZ
is similar. Ie. Don't interrupt every 10mS, just to increment ticks. In theNO_HZ
case, each CPU can set a future wake-up time based on what sort of requests drivers and sub-systems have given. Ie,schedule_work()
needs to be run in 175ms, then the timer is set to that value for the CPU and we don't wake-up 17 times (if the system tick is 10mS), but just increment ticks by 17. Some CPUs may need a timeout to evict the current process to run another for multi-tasking as well, so the scheduler itself may set a timer.