How do system calls work

compiler-constructioninterruptoperating systemprocesssystem-calls

I understand that a user can own a process and each process has an address space (which contains valid memory locations, this process can reference). I know that a process can call a system call and pass parameters to it, just like any other library function. This seems to suggest that all system calls are in a process address space by sharing memory, etc. But perhaps, this is only an illusion created by the fact that in high level programming language, system calls look like any other function, when a process calls it.

But, now let me take a step deeper and analyze more closely on what happens under the hood. How does compiler compile a system call? It perhaps pushes the system call name and parameters supplied by the process in a stack and then put the assembly instruction say "TRAP" or something — basically the assembly instruction to call a software interrupt.

This TRAP assembly instruction is executed by hardware by first toggling the mode bit from user to kernel and then setting the code pointer to say beginning of interrupt service routines. From this point on, the ISR executes in kernel mode, which picks up the parameters from the stack (this is possible, because kernel has access to any memory location, even the ones owned by user processes) and executes the system call and in the end relinquishes the CPU, which again toggles the mode bit and the user process starts from where it left off.

Is my understanding correct?

Attached is rough diagram of my understanding:
enter image description here

Best Answer

Your understanding is pretty close; the trick is that most compilers will never write system calls, because the functions that programs call (e.g. getpid(2), chdir(2), etc.) are actually provided by the standard C library. The standard C library contains the code for the system call, whether it is called via INT 0x80 or SYSENTER. It'd be a strange program that makes system calls without a library doing the work. (Even though perl provides a syscall() function that can directly make system calls! Crazy, right?)

Next, the memory. The operating system kernel sometimes has easy address-space access to the user process memory. Of course, protection modes are different, and user-supplied data must be copied into the kernel's protected address space to prevent modification of user-supplied data while the system call is in flight:

static int do_getname(const char __user *filename, char *page)
{
    int retval;
    unsigned long len = PATH_MAX;

    if (!segment_eq(get_fs(), KERNEL_DS)) {
        if ((unsigned long) filename >= TASK_SIZE)
            return -EFAULT;
        if (TASK_SIZE - (unsigned long) filename < PATH_MAX)
            len = TASK_SIZE - (unsigned long) filename;
    }

    retval = strncpy_from_user(page, filename, len);
    if (retval > 0) {
        if (retval < len)
            return 0;
        return -ENAMETOOLONG;
    } else if (!retval)
        retval = -ENOENT;
    return retval;
}

This, while it isn't a system call itself, is a helper function called by system call functions that copies filenames into the kernel's address space. It checks to make sure that the entire filename resides within the user's data range, calls a function that copies the string in from user space, and performs some sanity checks before the returning.

get_fs() and similar functions are remnants from Linux's x86-roots. The functions have working implementations for all architectures, but the names remain archaic.

All the extra work with segments is because the kernel and userspace might share some portion of the available address space. On a 32-bit platform (where the numbers are easy to comprehend), the kernel will typically have one gigabyte of virtual address space, and user processes will typically have three gigabytes of virtual address space.

When a process calls into the kernel, the kernel will 'fix up' the page table permissions to allow it access to the whole range, and gets the benefit of pre-filled TLB entries for user-provided memory. Great success. But when the kernel must context switch back to userspace, it has to flush the TLB to remove the cached privileges on kernel address space pages.

But the trick is, one gigabyte of virtual address space is not sufficient for all kernel data structures on huge machines. Maintaining the metadata of cached filesystems and block device drivers, networking stacks, and the memory mappings for all the processes on the system, can take a huge amount of data.

So different 'splits' are available: two gigs for user, two gigs for kernel, one gig for user, three gigs for kernel, etc. As the space for the kernel goes up, the space for user processes goes down. So there is a 4:4 memory split that gives four gigabytes to the user process, four gigabytes to the kernel, and the kernel must fiddle with segment descriptors to be able to access user memory. The TLB is flushed entering and exiting system calls, which is a pretty significant speed penalty. But it lets the kernel maintain significantly larger data structures.

The much larger page tables and address ranges of 64 bit platforms probably makes all the preceding look quaint. I sure hope so, anyway.

Related Solutions

Python – How to safely create a nested directory in Python

On Python ≥ 3.5, use pathlib.Path.mkdir:

from pathlib import Path
Path("/my/directory").mkdir(parents=True, exist_ok=True)

For older versions of Python, I see two answers with good qualities, each with a small flaw, so I will give my take on it:

Try os.path.exists, and consider os.makedirs for the creation.

import os
if not os.path.exists(directory):
    os.makedirs(directory)

As noted in comments and elsewhere, there's a race condition – if the directory is created between the os.path.exists and the os.makedirs calls, the os.makedirs will fail with an OSError. Unfortunately, blanket-catching OSError and continuing is not foolproof, as it will ignore a failure to create the directory due to other factors, such as insufficient permissions, full disk, etc.

One option would be to trap the OSError and examine the embedded error code (see Is there a cross-platform way of getting information from Python’s OSError):

import os, errno

try:
    os.makedirs(directory)
except OSError as e:
    if e.errno != errno.EEXIST:
        raise

Alternatively, there could be a second os.path.exists, but suppose another created the directory after the first check, then removed it before the second one – we could still be fooled.

Depending on the application, the danger of concurrent operations may be more or less than the danger posed by other factors such as file permissions. The developer would have to know more about the particular application being developed and its expected environment before choosing an implementation.

Modern versions of Python improve this code quite a bit, both by exposing FileExistsError (in 3.3+)...

try:
    os.makedirs("path/to/directory")
except FileExistsError:
    # directory already exists
    pass

...and by allowing a keyword argument to os.makedirs called exist_ok (in 3.2+).

os.makedirs("path/to/directory", exist_ok=True)  # succeeds even if directory exists.

Private bytes, virtual bytes, working set

The short answer to this question is that none of these values are a reliable indicator of how much memory an executable is actually using, and none of them are really appropriate for debugging a memory leak.

Private Bytes refer to the amount of memory that the process executable has asked for - not necessarily the amount it is actually using. They are "private" because they (usually) exclude memory-mapped files (i.e. shared DLLs). But - here's the catch - they don't necessarily exclude memory allocated by those files. There is no way to tell whether a change in private bytes was due to the executable itself, or due to a linked library. Private bytes are also not exclusively physical memory; they can be paged to disk or in the standby page list (i.e. no longer in use, but not paged yet either).

Working Set refers to the total physical memory (RAM) used by the process. However, unlike private bytes, this also includes memory-mapped files and various other resources, so it's an even less accurate measurement than the private bytes. This is the same value that gets reported in Task Manager's "Mem Usage" and has been the source of endless amounts of confusion in recent years. Memory in the Working Set is "physical" in the sense that it can be addressed without a page fault; however, the standby page list is also still physically in memory but not reported in the Working Set, and this is why you might see the "Mem Usage" suddenly drop when you minimize an application.

Virtual Bytes are the total virtual address space occupied by the entire process. This is like the working set, in the sense that it includes memory-mapped files (shared DLLs), but it also includes data in the standby list and data that has already been paged out and is sitting in a pagefile on disk somewhere. The total virtual bytes used by every process on a system under heavy load will add up to significantly more memory than the machine actually has.

So the relationships are:

Private Bytes are what your app has actually allocated, but include pagefile usage;
Working Set is the non-paged Private Bytes plus memory-mapped files;
Virtual Bytes are the Working Set plus paged Private Bytes and standby list.

There's another problem here; just as shared libraries can allocate memory inside your application module, leading to potential false positives reported in your app's Private Bytes, your application may also end up allocating memory inside the shared modules, leading to false negatives. That means it's actually possible for your application to have a memory leak that never manifests itself in the Private Bytes at all. Unlikely, but possible.

Private Bytes are a reasonable approximation of the amount of memory your executable is using and can be used to help narrow down a list of potential candidates for a memory leak; if you see the number growing and growing constantly and endlessly, you would want to check that process for a leak. This cannot, however, prove that there is or is not a leak.

One of the most effective tools for detecting/correcting memory leaks in Windows is actually Visual Studio (link goes to page on using VS for memory leaks, not the product page). Rational Purify is another possibility. Microsoft also has a more general best practices document on this subject. There are more tools listed in this previous question.

I hope this clears a few things up! Tracking down memory leaks is one of the most difficult things to do in debugging. Good luck.

Best Answer

Related Solutions

Python – How to safely create a nested directory in Python

Private bytes, virtual bytes, working set

Related Topic