Docker – Why does linux sys fs modification work in plain docker but not under kubernetes

dockerkubernetesredis

The command being run inside the containers is:

echo never | tee /sys/kernel/mm/transparent_hugepage/enabled

Both containers run as privileged. But in the kubernetes docker container the command fails with error:
tee: /sys/kernel/mm/transparent_hugepage/enabled: Read-only file system

and under just plain docker run -it --privileged alpine /bin/sh the command works fine.

I have used docker inspect on both k8s and non-k8s containers to verify privileged status and don't see anything else listed that should cause this problem – I've run diff between both outputs and then used docker run with modifications to try and reproduce the problem in plain docker but failed (it stays working). Any idea why the kubernetes docker container fails and the plain docker container succeeds?

This is reproducible with the pod definition here:

apiVersion: v1
kind: Pod
metadata:
  name: sys-fs-edit
spec:
  containers:
  - image: alpine
    command:
    - /bin/sh
    args:
      - -c
      - echo never | tee /sys/kernel/mm/transparent_hugepage/enabled && sysctl -w net.core.somaxconn=8192 vm.overcommit_memory=1 && sleep 9999999d
    imagePullPolicy: Always
    name: sysctl-buddy
    securityContext:
      privileged: true

Workaround

While I still don't know the cause for the discrepancy, the problem can be mitigated by remounting /sys as read-write.

apiVersion: v1
kind: Pod
metadata:
  name: sys-fs-edit
spec:
  containers:
  - image: alpine
    command:
    - /bin/sh
    args:
      - -c
      - echo never | tee /sys/kernel/mm/transparent_hugepage/enabled && sysctl -w net.core.somaxconn=8192 vm.overcommit_memory=1 && sleep 9999999d
    imagePullPolicy: Always
    name: sysctl-buddy
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /sys
      name: sys
      readOnly: false
  volumes:
  - hostPath:
      path: /sys
    name: sys

Best Answer

On kubernetes it works a bit differently. Setting privileged: true in a securityContext of a container is not enough to be able to modify any sysctl of such container.

Take a look at this section of the official kubernetes docs that describes Using sysctls in a Kubernetes Cluster. As you can read here:

Sysctls are grouped into safe and unsafe sysctls. In addition to proper namespacing, a safe sysctl must be properly isolated between pods on the same node. This means that setting a safe sysctl for one pod

must not have any influence on any other pod on the node

must not allow to harm the node's health

must not allow to gain CPU or memory resources outside of the resource limits of a pod.

By far, most of the namespaced sysctls are not necessarily considered safe. The following sysctls are supported in the safe set:

kernel.shm_rmid_forced,

net.ipv4.ip_local_port_range,

net.ipv4.tcp_syncookies,

net.ipv4.ping_group_range (since Kubernetes 1.18).

So in short, there are safe and unsafe sysctls. Most of them are considered as unsafe, even many of those which are namespaced. Unsafe sysctls need to be additionally enabled by the cluster admin on a node-by-node basis:

All safe sysctls are enabled by default.

All unsafe sysctls are disabled by default and must be allowed manually by the cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to launch.

With the warning above in mind, the cluster admin can allow certain unsafe sysctls for very special situations such as high-performance or real-time application tuning. Unsafe sysctls are enabled on a node-by-node basis with a flag of the kubelet; for example:
kubelet --allowed-unsafe-sysctls \  
'kernel.msg*,net.core.somaxconn' ...

So you cannot simply set any sysctl arbitrarily even from a privileged container running on your kubernetes cluster.

Related Solutions

Docker Compose – How Does ‘restart: always’ Policy Work?

When you use docker kill, this is the expected behavior as Docker does not restart the container: "If you manually stop a container, its restart policy is ignored until the Docker daemon restarts or the container is manually restarted. This is another attempt to prevent a restart loop" (reference)

If you use docker stop or docker kill, you're manually stopping the container. You can do some tests about restart policies: restarting the docker daemon, rebooting your server, using a CMD inside a container and running an exit...

For example if I kill my container deployed with a restart policy, I see that it exited with code 137 but it is not restarted according to docker ps -a, it remains exited:

[root@andromeda ~]# docker ps --all
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS                        PORTS               NAMES
819d1264c30a        redis:alpine        "docker-entrypoint..."   3 minutes ago       Exited (137) 34 seconds ago                       keepalive_redis_1

But if I restart the daemon...

[root@andromeda ~]# docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
819d1264c30a        redis:alpine        "docker-entrypoint..."   30 minutes ago      Up 2 seconds        6379/tcp            keepalive_redis_1

The container that was set with restart policy, starts again which is what documentation say, so docker kill is not the way you should test the restart policy as it's assumed that you have deliberately stopped the container and Docker wants to have a way to prevent restarting loops, if you kill it, you really want to kill it.

I found the following links valuable that show the same behavior in different versions (so it's not a bug but the expected behavior):

Cron – In Kubernetes, how can a container created from a CronJob find out when it was scheduled

I need the container that results from the Job to know when it was scheduled to run. Is this possible?

The short answer is yes.

You can get information about the job and the pod starting/creating timestamp from API-server.

All you need is to call $api-server-ip:port/api/v1/namespace/$namespace-name/pods/$podname

It will receive JSON with details about pod. You can parse this JSON and get the timestamp. The only thing necessary is a pod name (which is usually its hostname). That's all you need to get the timestamp. For parsing JSON you may use any JSON library for any programing language.

Best Answer

Related Solutions

Docker Compose – How Does ‘restart: always’ Policy Work?

Cron – In Kubernetes, how can a container created from a CronJob find out when it was scheduled

Related Topic