Why isn’t there generic batching syscall in Linux/BSD

bsdclinux-development

Background:

System call overhead is much larger than function call overhead (estimates range from 20-100x) mostly due to context switching from user space to kernel space and back. It is common to inline functions to save function call overhead and function calls are much cheaper than syscalls. It stands to reason that developers would want to avoid some of the system call overhead by taking care of as much in-kernel operation in one syscall as possible.

Problem:

This has created a lot of (superfluous?) system calls like sendmmsg(), recvmmsg() as well as the chdir, open, lseek and/or symlink combinations like: openat, mkdirat, mknodat, fchownat, futimesat, newfstatat, unlinkat, fchdir, ftruncate, fchmod, renameat, linkat, symlinkat, readlinkat, fchmodat, faccessat, lsetxattr, fsetxattr, execveat, lgetxattr, llistxattr, lremovexattr, fremovexattr, flistxattr, fgetxattr, pread, pwrite etc…

Now Linux has added copy_file_range() which apparently combines read lseek and write syscalls. Its only a matter of time before this becomes fcopy_file_range(), lcopy_file_range(), copy_file_rangeat(), fcopy_file_rangeat() and lcopy_file_rangeat()…but since there are 2 files involved instead of X more calls, it could become X^2 more. OK, Linus and the various BSD developers wouldn't let it go that far, but my point is that if there were a batching syscall, all(most?) of these could be implemented in user space and reduce the kernel complexity without adding much if any overhead on the libc side.

Many complex solutions have been proposed that include some form special syscall thread for non-blocking syscalls to batch process syscalls; however these methods add significant complexity to both the kernel and user space in much the same way as libxcb vs. libX11 (the asynchronous calls require a lot more setup)

Solution?:

A generic batching syscall. This would alleviate the largest cost (multiple mode switches) without the complexities associated with having specialized kernel thread (though that functionality could be added later).

There is basically already a good basis for a prototype in the socketcall() syscall. Just extend it from taking a array of arguments to instead take an array of returns, pointer to arrays of arguments (which includes the syscall number), the number of syscalls and a flags argument… something like:

batch(void *returns, void *args, long ncalls, long flags);

One major difference would be that the arguments would probably all need to be pointers for simplicity so that the results of prior syscalls could be used by subsequent syscalls (for instance the file descriptor from open() for use in read()/write())

Some possible advantages:

less user space -> kernel space -> user space switching
possible compiler switch -fcombine-syscalls to try to batch automagically
optional flag for asynchronous operation (return fd to watch immediately)
ability to implement future combined syscall functions in userspace

Question:

Is it feasible to implement a batching syscall?

Am I missing some obvious gotchas?
Am I overestimating the benefits?

Is it worthwhile for me to bother implementing a batching syscall (I don't work at Intel, Google or Redhat)?

I have patched my own kernel before, but dread dealing with the LKML.
History has shown that even if something is widely useful to "normal" users (non-corporate end users without git write access), it may never get accepted upstream (unionfs, aufs, cryptodev, tuxonice, etc…)

References:

Best Answer

I tried this on x86_64

Patch against 94836ecf1e7378b64d37624fbb81fe48fbd4c772: (also here https://github.com/pskocik/linux/tree/supersyscall )

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..8df2e98eb403 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330    common  pkey_alloc      sys_pkey_alloc
 331    common  pkey_free       sys_pkey_free
 332    common  statx           sys_statx
+333    common  supersyscall            sys_supersyscall

 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 980c3c9b06f8..c61c14e3ff4e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -905,5 +905,20 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
 asmlinkage long sys_pkey_free(int pkey);
 asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
              unsigned mask, struct statx __user *buffer);
-
 #endif
+
+struct supersyscall_args {
+    unsigned call_nr;
+    long     args[6];
+};
+#define SUPERSYSCALL__abort_on_failure    0
+#define SUPERSYSCALL__continue_on_failure 1
+/*#define SUPERSYSCALL__lock_something    2?*/
+
+
+asmlinkage 
+long 
+sys_supersyscall(long* Rets, 
+                 struct supersyscall_args *Args, 
+                 int Nargs, 
+                 int Flags);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a076cf1a3a23..56184b84530f 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -732,9 +732,11 @@ __SYSCALL(__NR_pkey_alloc,    sys_pkey_alloc)
 __SYSCALL(__NR_pkey_free,     sys_pkey_free)
 #define __NR_statx 291
 __SYSCALL(__NR_statx,     sys_statx)
+#define __NR_supersyscall 292
+__SYSCALL(__NR_supersyscall,     sys_supersyscall)

 #undef __NR_syscalls
-#define __NR_syscalls 292
+#define __NR_syscalls (__NR_supersyscall+1)

 /*
  * All syscalls below here should go away really,
diff --git a/init/Kconfig b/init/Kconfig
index a92f27da4a27..25f30bf0ebbb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2184,4 +2184,9 @@ config ASN1
      inform it as to what tags are to be expected in a stream and what
      functions to call on what tags.

+config SUPERSYSCALL
+     bool
+     help
+        System call for batching other system calls
+
 source "kernel/Kconfig.locks"
diff --git a/kernel/Makefile b/kernel/Makefile
index b302b4731d16..4d86bcf90f90 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y     = fork.o exec_domain.o panic.o \
        extable.o params.o \
        kthread.o sys_ni.o nsproxy.o \
        notifier.o ksysfs.o cred.o reboot.o \
-       async.o range.o smpboot.o ucount.o
+       async.o range.o smpboot.o ucount.o supersyscall.o

 obj-$(CONFIG_MULTIUSER) += groups.o

diff --git a/kernel/supersyscall.c b/kernel/supersyscall.c
new file mode 100644
index 000000000000..d7fac5d3f970
--- /dev/null
+++ b/kernel/supersyscall.c
@@ -0,0 +1,83 @@
+#include <linux/syscalls.h>
+#include <linux/uaccess.h>
+#include <linux/compiler.h>
+#include <linux/sched/signal.h>
+
+/*TODO: do this properly*/
+/*#include <uapi/asm-generic/unistd.h>*/
+#ifndef __NR_syscalls
+# define __NR_syscalls (__NR_supersyscall+1)
+#endif
+
+#define uif(Cond)  if(unlikely(Cond))
+#define lif(Cond)  if(likely(Cond))
+ 
+
+typedef asmlinkage long (*sys_call_ptr_t)(unsigned long, unsigned long,
+                     unsigned long, unsigned long,
+                     unsigned long, unsigned long);
+extern const sys_call_ptr_t sys_call_table[];
+
+static bool 
+syscall__failed(unsigned long Ret)
+{
+   return (Ret > -4096UL);
+}
+
+
+static bool
+syscall(unsigned Nr, long A[6])
+{
+    uif (Nr >= __NR_syscalls )
+        return -ENOSYS;
+    return sys_call_table[Nr](A[0], A[1], A[2], A[3], A[4], A[5]);
+}
+
+
+static int 
+segfault(void const *Addr)
+{
+    struct siginfo info[1];
+    info->si_signo = SIGSEGV;
+    info->si_errno = 0;
+    info->si_code = 0;
+    info->si_addr = (void*)Addr;
+    return send_sig_info(SIGSEGV, info, current);
+    //return force_sigsegv(SIGSEGV, current);
+}
+
+asmlinkage long /*Ntried*/
+sys_supersyscall(long* Rets, 
+                 struct supersyscall_args *Args, 
+                 int Nargs, 
+                 int Flags)
+{
+    int i = 0, nfinished = 0;
+    struct supersyscall_args args; /*7 * sizeof(long) */
+    
+    for (i = 0; i<Nargs; i++){
+        long ret;
+
+        uif (0!=copy_from_user(&args, Args+i, sizeof(args))){
+            segfault(&Args+i);
+            return nfinished;
+        }
+
+        ret = syscall(args.call_nr, args.args);
+        nfinished++;
+
+        if ((Flags & 1) == SUPERSYSCALL__abort_on_failure 
+                &&  syscall__failed(ret))
+            return nfinished;
+
+
+        uif (0!=put_user(ret, Rets+1)){
+            segfault(Rets+i);
+            return nfinished;
+        }
+    }
+    return nfinished;
+
+}
+
+
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 8acef8576ce9..c544883d7a13 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -258,3 +258,5 @@ cond_syscall(sys_membarrier);
 cond_syscall(sys_pkey_mprotect);
 cond_syscall(sys_pkey_alloc);
 cond_syscall(sys_pkey_free);
+
+cond_syscall(sys_supersyscall);

And it appears to work -- I can write hello to fd 1 and world to fd 2 with just one syscall:

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <stdio.h>


struct supersyscall_args {
    unsigned  call_nr;
    long args[6];
};
#define SUPERSYSCALL__abort_on_failure    0
#define SUPERSYSCALL__continue_on_failure 1

long 
supersyscall(long* Rets, 
                 struct supersyscall_args *Args, 
                 int Nargs, 
                 int Flags);

int main(int c, char**v)
{
    puts("HELLO WORLD:");
    long r=0;
    struct supersyscall_args args[] = { 
        {SYS_write, {1, (long)"hello\n", 6 }},
        {SYS_write, {2, (long)"world\n", 6 }},
    };
    long rets[sizeof args / sizeof args[0]];

    r = supersyscall(rets, 
                     args,
                     sizeof(rets)/sizeof(rets[0]), 
                     0);
    printf("r=%ld\n", r);
    printf( 0>r ? "%m\n" : "\n");

    puts("");
#if 1

#if SEGFAULT 
    r = supersyscall(0, 
                     args,
                     sizeof(rets)/sizeof(rets[0]), 
                     0);
    printf("r=%ld\n", r);
    printf( 0>r ? "%m\n" : "\n");
#endif
#endif
    return 0;
}

long 
supersyscall(long* Rets, 
                 struct supersyscall_args *Args, 
                 int Nargs, 
                 int Flags)
{
    return syscall(333, Rets, Args, Nargs, Flags);
}

Basically I'm using:

long a_syscall(long, long, long, long, long, long);

as a universal syscall prototype, which appears to be how things work on x86_64, so my "super" syscall is:

struct supersyscall_args {
    unsigned call_nr;
    long     args[6];
};
#define SUPERSYSCALL__abort_on_failure    0
#define SUPERSYSCALL__continue_on_failure 1
/*#define SUPERSYSCALL__lock_something    2?*/

asmlinkage 
long 
sys_supersyscall(long* Rets, 
                 struct supersyscall_args *Args, 
                 int Nargs, 
                 int Flags);

It returns the number of syscalls tried (==Nargs if the SUPERSYSCALL__continue_on_failure flag is passed, otherwise >0 && <=Nargs) and failures to copy between kernels space and user space are signalled by segfaults instead of the usual -EFAULT.

What I don't know is how this would port to other architectures, but it would sure be nice to have something like this in the kernel.

If this were possible for all archs, I imagine there could be a userspace wrapper that would provide type safety through some unions and macros (it could select a union member based on the syscall name and all the unions would then get converted to the 6 longs or whatever the architecture de jour's equivalent of the 6 longs would be).

Related Solutions

Why isn’t OCaml more popular

The first answer is that nobody really knows why languages become popular, and anybody who says otherwise is deluded or has an agenda. (It's often easy to identify why a language fails to become popular, but that's another question.)

With that disclaimer, here are some points that are suggestive, most important first:

The first mature C compiler appeared in 1974; the first mature OCaml compiler appeared in the late 1990s. C has a 25-year head start.
C shipped with Unix, which was the biggest "killer app" of all time. For a long time, every CS department in the world had to have Unix, which meant that every instructor and everyone taking a CS course had an opportunity to be exposed to C. OCaml and ML are still waiting for their first killer app. (MLdonkey is cool, but it's not Unix.)
C fills its niche so well that I doubt there will never be another low-level language devoted only to systems programming. (To see the evidence in favor, read Dennis Ritchie's paper on the history of C from HOPL II.) It's not even clear what OCaml's niche is, and Standard ML's niche is only a little clearer. So Caml and ML have quite a few competitors, whereas C killed off its only competitor (BLISS).
One of C's great strengths is that its cost model is very predictable: it is easy to look at any small fragment of C code can instantly get an accurate idea of what machine operations will have to be performed to execute that code. OCaml's cost model is much less clear, especially because memory allocation is much less explicit, and the overall cost of memory allocation (equals cost of allocation plus costs incurred during garbage collection) depends on emergent properties like how long objects live and which objects refer to other objects. The net result is that performance is hard to predict, and even hard to analyze after the fact. (OCaml's memory-profiling tools are not what they should be.) As a result, OCaml is not good for applications where performance must be very predictable---like embedded systems.
C is a language with a standard and many compilers. OCaml is a software artifact: the only compiler is from a single source, and the compiler is the standard. And that standard changes with every release. For people who value stability and backward compatibility, a single-source language may represent an unacceptable risk.
Anybody with a halfway-decent undergraduate compiler course and a lot of persistence can write a C compiler that more or less works, and with adequate performance. To get an implementation of OCaml or ML off the ground requires a lot more education, and to get comparable performance to a naive C compiler requires a lot more work. This means there are a lot fewer hobbyists to mess around with languages like OCaml, so it's harder tor the community to develop a deep understanding about how to exploit it.

Java – How bad is it calling println() often than concatenating strings together and calling it once

There are two 'forces' here, in tension: Performance vs. Readability.

Let's tackle the third problem first though, long lines:

System.out.println("Good morning everyone. I am here today to present you with a very, very lengthy sentence in order to prove a point about how it looks strange amongst other code.");

The best way to implement this and keep readibility, is to use string concatenation:

System.out.println("Good morning everyone. I am here today to present you "
                 + "with a very, very lengthy sentence in order to prove a "
                 + "point about how it looks strange amongst other code.");

The String-constant concatenation will happen at compile time, and will have no effect on performance at all. The lines are readable, and you can just move on.

Now, about the:

System.out.println("Good morning.");
System.out.println("Please enter your name");

vs.

System.out.println("Good morning.\nPlease enter your name");

The second option is significantly faster. I will suggest about 2X as fast.... why?

Because 90% (with a wide margin of error) of the work is not related to dumping the characters to the output, but is overhead needed to secure the output to write to it.

Synchronization

System.out is a PrintStream. All Java implementations that I know of, internally synchronize the PrintStream: See the code on GrepCode!.

What does this mean for your code?

It means that each time you call System.out.println(...) you are synchronizing your memory model, you are checking and waiting for a lock. Any other threads calling System.out will also be locked.

In single-threaded applications the impact of System.out.println() is often limited by the IO performance of your system, how fast can you write out to file. In multithreaded applications, the locking can be more of an issue than the IO.

Flushing

Each println is flushed. This causes the buffers to be cleared and triggers a Console-level write to the buffers. The amount of effort done here is implementation dependant, but, it is generally understood that the performance of the flush is only in small part related to the size of the buffer being flushed. There is a significant overhead related to the flush, where memory buffers are marked as dirty, the Virtual machine is performing IO, and so on. Incurring that overhead once, instead of twice, is an obvious optimization.

Some numbers

I put together the following little test:

public class ConsolePerf {

    public static void main(String[] args) {
        for (int i = 0; i < 100; i++) {
            benchmark("Warm " + i);
        }
        benchmark("real");
    }

    private static void benchmark(String string) {
        benchString(string + "short", "This is a short String");
        benchString(string + "long", "This is a long String with a number of newlines\n"
                  + "in it, that should simulate\n"
                  + "printing some long sentences and log\n"
                  + "messages.");
        
    }
    
    private static final int REPS = 1000;

    private static void benchString(String name, String value) {
        long time = System.nanoTime();
        for (int i = 0; i < REPS; i++) {
            System.out.println(value);
        }
        double ms = (System.nanoTime() - time) / 1000000.0;
        System.err.printf("%s run in%n    %12.3fms%n    %12.3f lines per ms%n    %12.3f chars per ms%n",
                name, ms, REPS/ms, REPS * (value.length() + 1) / ms);
        
    }

    
}

The code is relatively simple, it repeatedly prints either a short, or a long string to output. The long String has multiple newlines in it. It measures how long it takes to print 1000 iterations of each.

If I run it at the unix (Linux) command-prompt, and redirect the STDOUT to /dev/null, and print the actual results to STDERR, I can do the following:

java -cp . ConsolePerf > /dev/null 2> ../errlog

The output (in errlog) looks like:

Warm 0short run in
           7.264ms
         137.667 lines per ms
        3166.345 chars per ms
Warm 0long run in
           1.661ms
         602.051 lines per ms
       74654.317 chars per ms
Warm 1short run in
           1.615ms
         619.327 lines per ms
       14244.511 chars per ms
Warm 1long run in
           2.524ms
         396.238 lines per ms
       49133.487 chars per ms
.......
Warm 99short run in
           1.159ms
         862.569 lines per ms
       19839.079 chars per ms
Warm 99long run in
           1.213ms
         824.393 lines per ms
      102224.706 chars per ms
realshort run in
           1.204ms
         830.520 lines per ms
       19101.959 chars per ms
reallong run in
           1.215ms
         823.160 lines per ms
      102071.811 chars per ms

What does this mean? Let me repeat the last 'stanza':

realshort run in
           1.204ms
         830.520 lines per ms
       19101.959 chars per ms
reallong run in
           1.215ms
         823.160 lines per ms
      102071.811 chars per ms

It means that, for all intents and purposes, even though the 'long' line is about 5-times longer, and contains multiple newlines, it takes just about as long to output as the short line.

The number of characters-per-second for the long run is 5 times as much, and the elapsed time is about the same.....

In other words, your performance scales relative to the number of printlns you have, not what they print.

Update: What happens if you redirect to a file, instead of to /dev/null?

realshort run in
           2.592ms
         385.815 lines per ms
        8873.755 chars per ms
reallong run in
           2.686ms
         372.306 lines per ms
       46165.955 chars per ms

It is a whole lot slower, but the proportions are about the same....