Background:
System call overhead is much larger than function call overhead (estimates range from 20-100x) mostly due to context switching from user space to kernel space and back. It is common to inline functions to save function call overhead and function calls are much cheaper than syscalls. It stands to reason that developers would want to avoid some of the system call overhead by taking care of as much in-kernel operation in one syscall as possible.
Problem:
This has created a lot of (superfluous?) system calls like sendmmsg(), recvmmsg() as well as the chdir, open, lseek and/or symlink combinations like: openat
, mkdirat
, mknodat
, fchownat
, futimesat
, newfstatat
, unlinkat
, fchdir
, ftruncate
, fchmod
, renameat
, linkat
, symlinkat
, readlinkat
, fchmodat
, faccessat
, lsetxattr
, fsetxattr
, execveat
, lgetxattr
, llistxattr
, lremovexattr
, fremovexattr
, flistxattr
, fgetxattr
, pread
, pwrite
etc…
Now Linux has added copy_file_range()
which apparently combines read lseek and write syscalls. Its only a matter of time before this becomes fcopy_file_range(), lcopy_file_range(), copy_file_rangeat(), fcopy_file_rangeat() and lcopy_file_rangeat()…but since there are 2 files involved instead of X more calls, it could become X^2 more. OK, Linus and the various BSD developers wouldn't let it go that far, but my point is that if there were a batching syscall, all(most?) of these could be implemented in user space and reduce the kernel complexity without adding much if any overhead on the libc side.
Many complex solutions have been proposed that include some form special syscall thread for non-blocking syscalls to batch process syscalls; however these methods add significant complexity to both the kernel and user space in much the same way as libxcb vs. libX11 (the asynchronous calls require a lot more setup)
Solution?:
A generic batching syscall. This would alleviate the largest cost (multiple mode switches) without the complexities associated with having specialized kernel thread (though that functionality could be added later).
There is basically already a good basis for a prototype in the socketcall() syscall. Just extend it from taking a array of arguments to instead take an array of returns, pointer to arrays of arguments (which includes the syscall number), the number of syscalls and a flags argument… something like:
batch(void *returns, void *args, long ncalls, long flags);
One major difference would be that the arguments would probably all need to be pointers for simplicity so that the results of prior syscalls could be used by subsequent syscalls (for instance the file descriptor from open()
for use in read()
/write()
)
Some possible advantages:
- less user space -> kernel space -> user space switching
- possible compiler switch -fcombine-syscalls to try to batch automagically
- optional flag for asynchronous operation (return fd to watch immediately)
- ability to implement future combined syscall functions in userspace
Question:
Is it feasible to implement a batching syscall?
- Am I missing some obvious gotchas?
- Am I overestimating the benefits?
Is it worthwhile for me to bother implementing a batching syscall (I don't work at Intel, Google or Redhat)?
- I have patched my own kernel before, but dread dealing with the LKML.
- History has shown that even if something is widely useful to "normal" users (non-corporate end users without git write access), it may never get accepted upstream (unionfs, aufs, cryptodev, tuxonice, etc…)
References:
Best Answer
I tried this on x86_64
Patch against 94836ecf1e7378b64d37624fbb81fe48fbd4c772: (also here https://github.com/pskocik/linux/tree/supersyscall )
And it appears to work -- I can write hello to fd 1 and world to fd 2 with just one syscall:
Basically I'm using:
as a universal syscall prototype, which appears to be how things work on x86_64, so my "super" syscall is:
It returns the number of syscalls tried (
==Nargs
if theSUPERSYSCALL__continue_on_failure
flag is passed, otherwise>0 && <=Nargs
) and failures to copy between kernels space and user space are signalled by segfaults instead of the usual-EFAULT
.What I don't know is how this would port to other architectures, but it would sure be nice to have something like this in the kernel.
If this were possible for all archs, I imagine there could be a userspace wrapper that would provide type safety through some unions and macros (it could select a union member based on the syscall name and all the unions would then get converted to the 6 longs or whatever the architecture de jour's equivalent of the 6 longs would be).