Docker – Why does linux sys fs modification work in plain docker but not under kubernetes

dockerkubernetesredis

The command being run inside the containers is:

echo never | tee /sys/kernel/mm/transparent_hugepage/enabled

Both containers run as privileged. But in the kubernetes docker container the command fails with error:
tee: /sys/kernel/mm/transparent_hugepage/enabled: Read-only file system

and under just plain docker run -it --privileged alpine /bin/sh the command works fine.

I have used docker inspect on both k8s and non-k8s containers to verify privileged status and don't see anything else listed that should cause this problem – I've run diff between both outputs and then used docker run with modifications to try and reproduce the problem in plain docker but failed (it stays working). Any idea why the kubernetes docker container fails and the plain docker container succeeds?

This is reproducible with the pod definition here:

apiVersion: v1
kind: Pod
metadata:
  name: sys-fs-edit
spec:
  containers:
  - image: alpine
    command:
    - /bin/sh
    args:
      - -c
      - echo never | tee /sys/kernel/mm/transparent_hugepage/enabled && sysctl -w net.core.somaxconn=8192 vm.overcommit_memory=1 && sleep 9999999d
    imagePullPolicy: Always
    name: sysctl-buddy
    securityContext:
      privileged: true

Workaround

While I still don't know the cause for the discrepancy, the problem can be mitigated by remounting /sys as read-write.

apiVersion: v1
kind: Pod
metadata:
  name: sys-fs-edit
spec:
  containers:
  - image: alpine
    command:
    - /bin/sh
    args:
      - -c
      - echo never | tee /sys/kernel/mm/transparent_hugepage/enabled && sysctl -w net.core.somaxconn=8192 vm.overcommit_memory=1 && sleep 9999999d
    imagePullPolicy: Always
    name: sysctl-buddy
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /sys
      name: sys
      readOnly: false
  volumes:
  - hostPath:
      path: /sys
    name: sys

Best Answer

On kubernetes it works a bit differently. Setting privileged: true in a securityContext of a container is not enough to be able to modify any sysctl of such container.

Take a look at this section of the official kubernetes docs that describes Using sysctls in a Kubernetes Cluster. As you can read here:

Sysctls are grouped into safe and unsafe sysctls. In addition to proper namespacing, a safe sysctl must be properly isolated between pods on the same node. This means that setting a safe sysctl for one pod

  • must not have any influence on any other pod on the node
  • must not allow to harm the node's health
  • must not allow to gain CPU or memory resources outside of the resource limits of a pod.

By far, most of the namespaced sysctls are not necessarily considered safe. The following sysctls are supported in the safe set:

  • kernel.shm_rmid_forced,
  • net.ipv4.ip_local_port_range,
  • net.ipv4.tcp_syncookies,
  • net.ipv4.ping_group_range (since Kubernetes 1.18).

So in short, there are safe and unsafe sysctls. Most of them are considered as unsafe, even many of those which are namespaced. Unsafe sysctls need to be additionally enabled by the cluster admin on a node-by-node basis:

All safe sysctls are enabled by default.

All unsafe sysctls are disabled by default and must be allowed manually by the cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to launch.

With the warning above in mind, the cluster admin can allow certain unsafe sysctls for very special situations such as high-performance or real-time application tuning. Unsafe sysctls are enabled on a node-by-node basis with a flag of the kubelet; for example:

kubelet --allowed-unsafe-sysctls \  
'kernel.msg*,net.core.somaxconn' ...

So you cannot simply set any sysctl arbitrarily even from a privileged container running on your kubernetes cluster.

Related Topic