Google Cloud Kubernetes – Fix Systemd 100% CPU Usage

google-cloud-platformgoogle-kubernetes-enginekubernetes

Last week, after upgrading our GKE cluster to Kubernetes 1.13.6-gke.13, all of our nodes in the cluster started to fail due to high CPU usage. It's the Kubernetes software on the nodes themselves which uses up all the CPU, not the pods.

This is what top shows when we SSH into a node:

top - 10:11:27 up 5 days, 22 min,  1 user,  load average: 23.71, 21.90, 20.32
Tasks: 858 total,   3 running, 854 sleeping,   0 stopped,   1 zombie
%Cpu(s): 33.5 us, 30.9 sy,  0.0 ni, 35.5 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  30157.0 total,  14125.6 free,   4771.2 used,  11260.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  24762.7 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 root      20   0  114636  22564   4928 R  65.2   0.1   5254:53 systemd
    356 message+  20   0  110104  98784   2916 S  33.8   0.3   2705:49 dbus-daemon
   1236 root      20   0 3174876 147328  57464 S  22.0   0.5   1541:29 kubelet
    366 root      20   0   12020   3864   3396 S  21.6   0.0   1705:21 systemd-logind
   1161 root      20   0 5371428 586832  26756 S  18.7   1.9   1848:18 dockerd
    432 root      20   0 5585144  49040  13420 S  11.1   0.2 848:54.06 containerd
  23797 root      20   0  994620   8784   6088 S   3.0   0.0  96:58.79 containerd-shim
  45359 root      20   0  994620   8968   5600 S   3.0   0.0 100:28.46 containerd-shim
  35913 root      20   0 1068352   8192   5600 S   2.3   0.0 104:54.12 containerd-shim
  10806 root      20   0  994620   8908   5596 S   2.0   0.0 102:57.45 containerd-shim
  15378 root      20   0  994620   9084   5600 S   2.0   0.0 102:24.08 containerd-shim
  33141 root      20   0  994620   8856   5848 S   2.0   0.0  95:26.89 containerd-shim
  34299 root      20   0  994620   8824   5848 S   2.0   0.0  90:55.28 containerd-shim
  48411 root      20   0  994620   9488   6216 S   2.0   0.0  95:38.56 containerd-shim
1824641 root      20   0 1068352   6836   5716 S   2.0   0.0  65:45.81 containerd-shim
  10257 root      20   0  994620   9404   5596 S   1.6   0.0 101:10.45 containerd-shim
  15400 root      20   0 1068352   8916   6216 S   1.6   0.0  97:47.99 containerd-shim
  22895 root      20   0 1068352   8408   5472 S   1.6   0.0 102:55.97 containerd-shim
  29969 root      20   0  994620   9124   5600 S   1.6   0.0  98:32.32 containerd-shim
  34720 root      20   0  994620   8444   5472 S   1.6   0.0  97:23.98 containerd-shim
  10073 root      20   0 1068352   9796   6152 S   1.3   0.0 100:54.30 containerd-shim

After this began, pods on these nodes were CPU starved, worked poorly and got killed due to bad liveness checks.

To attempt to resolve the issue we recreated all the nodes. We created a new pool with equivalent resources and migrated all pods over by scaling down the old pool to 0 nodes. (This was difficult because at least 2 of our previous nodes failed to shut down, even after a long time. In the end we had to shut down the underlying VMs to kill those nodes.) It didn't help.

We also tried switching from Container OS to regular Ubuntu for the nodes. This also did not help.

Next we created yet another pool, this time with twice as much CPU. It helped but didn't fix the underlying problem. Some nodes still had extremely high CPU usage for systemd, dbus-daemon, kubelet etc.

Now we are running with tons of extra CPU per node and it masks the problem. There's enough CPU to also run actual pods in addition to the problematic system services.

But the system as a whole is still wasting tons of resources (and money) so we really need to figure out what's actually going on here. What's systemd doing that's so CPU intensive? What's dbus-daemon up to? We checked various logs (see below) but there are no obvious errors.

How do we find out what's actually wrong here?


Snippet from journalctl -u kubelet:

Jul 04 05:49:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 05:49:34.849808    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 05:54:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 05:54:34.850598    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 05:59:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 05:59:34.851797    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:04:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:04:34.858344    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:09:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:09:34.859626    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:14:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:14:34.861142    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:19:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:19:34.862185    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:24:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:24:34.863160    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:29:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:29:34.864156    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:34:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:34:34.865041    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:39:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:39:34.866044    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:44:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:44:34.866969    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:49:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:49:34.867979    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:54:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:54:34.869429    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:59:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:59:34.870359    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:04:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:04:34.871190    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:09:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:09:34.872063    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:14:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:14:34.873240    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:19:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:19:34.874123    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:24:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:24:34.875010    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:29:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:29:34.876612    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:34:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:34:34.877420    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:39:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:39:34.878368    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:44:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:44:34.879221    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:49:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:49:34.880239    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:54:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:54:34.881172    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:59:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:59:34.881868    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:04:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:04:34.882653    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:09:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:09:34.883432    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:14:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:14:34.884175    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:19:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:19:34.885043    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:24:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:24:34.885845    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:29:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:29:34.886690    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service

Output from strace -c -p `pidof systemd`:

strace: Process 1 attached
strace: Process 1 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 24.20    0.109101           5     22827           lstat
 17.45    0.078654           9      8610           sendmsg
 15.51    0.069914           7     10549           read
 10.96    0.049406           3     17310           getpid
  6.45    0.029075           8      3806           openat
  6.08    0.027385           7      3783           readlinkat
  5.76    0.025945           3      7579           fstat
  4.47    0.020167           4      5700           getrandom
  3.62    0.016301           4      3806           close
  3.14    0.014133           7      1892           access
  2.28    0.010278           5      1924        11 stat
  0.03    0.000145           4        33           epoll_wait
  0.02    0.000089           4        22           readlink
  0.01    0.000029           3        11           prctl
  0.01    0.000029           3        11           clock_gettime
  0.01    0.000027           2        11           getgid
  0.01    0.000026           2        11           geteuid
  0.00    0.000020           2        11           getuid
  0.00    0.000020           2        11           getegid
------ ----------- ----------- --------- --------- ----------------
100.00    0.450744                 87907        11 total

What's strace lstat'ing? Here are some relevant looking snippets I took at random from the full strace output:

...
stat("/dev/sds", {st_mode=S_IFBLK|0660, st_rdev=makedev(65, 32), ...}) = 0
readlinkat(AT_FDCWD, "/sys/dev/block/65:32", "../../devices/pci0000:00/0000:00"..., 99) = 83
lstat("/sys", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
lstat("/sys/dev", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/dev/block", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:19", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:19/0:0:19:0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:19/0:0:19:0/block", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:19/0:0:19:0/block/sds", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
access("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:19/0:0:19:0/block/sds/uevent", F_OK) = 0
openat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:19/0:0:19:0/block/sds/uevent", O_RDONLY|O_CLOEXEC) = 16

...

lstat("/sys", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
lstat("/sys/dev", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/dev/block", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:18", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:18/0:0:18:0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:18/0:0:18:0/block", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
lstat("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:18/0:0:18:0/block/sdr", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
access("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:18/0:0:18:0/block/sdr/uevent", F_OK) = 0

dbus-monitor --system --profile | head -n 20:

dbus-monitor: unable to enable new-style monitoring: org.freedesktop.DBus.Error.AccessDenied: "Rejected send message, 1 matched rules; type="method_call", sender=":1.165" (uid=5004 pid=769854 comm="dbus-monitor --system --profile ") interface="org.freedesktop.DBus.Monitoring" member="BecomeMonitor" error name="(unset)" requested_reply="0" destination="org.freedesktop.DBus" (bus)". Falling back to eavesdropping.
#type    timestamp    serial    sender    destination    path    interface    member
#                    in_reply_to
sig    1562263380.765023    2    org.freedesktop.DBus    :1.165    /org/freedesktop/DBus    org.freedesktop.DBus    NameAcquired
sig    1562263380.953812    132870362    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d5d1be6e08bfe7552d6f9ee50a943eca88dd0dd749ec248594aa0be91879a2cdb_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.957890    132870363    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d5d1be6e08bfe7552d6f9ee50a943eca88dd0dd749ec248594aa0be91879a2cdb_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.957912    132870364    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d539cd8b8367913182e438aa1b3b05714c8f3f131e20bcadabdeb850c375a8125_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.957918    132870365    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d539cd8b8367913182e438aa1b3b05714c8f3f131e20bcadabdeb850c375a8125_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.957923    132870366    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2dcontainers_2d4c04ee8d1bf693ff2c9300b198b2b47bbf2c240265af5b9edc1f07b6cbd0c1ce_2dmounts_2dshm_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958014    132870367    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2dcontainers_2d4c04ee8d1bf693ff2c9300b198b2b47bbf2c240265af5b9edc1f07b6cbd0c1ce_2dmounts_2dshm_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958020    132870368    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d872dd2c0a63a9f3b5c9c5e4972e06fcf89d28b4c7f59aea7ea4d38f5a6bf0db6_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958025    132870369    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d872dd2c0a63a9f3b5c9c5e4972e06fcf89d28b4c7f59aea7ea4d38f5a6bf0db6_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958029    132870370    :1.157    <none>    /org/freedesktop/systemd1/unit/home_2dkubernetes_2dcontainerized_5fmounter_2drootfs_2dvar_2dlib_2dkubelet_2dpods_2d2f4e6eae_5cx2d9e51_5cx2d11e9_5cx2db4ee_5cx2d42010a80000f_2dvolumes_2dkubernetes_2eio_5cx7esecret_2dfluentd_5cx2dgcp_5cx2dtoken_5cx2dzfrkb_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958111    132870371    :1.157    <none>    /org/freedesktop/systemd1/unit/home_2dkubernetes_2dcontainerized_5fmounter_2drootfs_2dvar_2dlib_2dkubelet_2dpods_2d2f4e6eae_5cx2d9e51_5cx2d11e9_5cx2db4ee_5cx2d42010a80000f_2dvolumes_2dkubernetes_2eio_5cx7esecret_2dfluentd_5cx2dgcp_5cx2dtoken_5cx2dzfrkb_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958117    132870372    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2dkubelet_2dpods_2d2f4e6eae_5cx2d9e51_5cx2d11e9_5cx2db4ee_5cx2d42010a80000f_2dvolumes_2dkubernetes_2eio_5cx7esecret_2dfluentd_5cx2dgcp_5cx2dtoken_5cx2dzfrkb_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958121    132870373    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2dkubelet_2dpods_2d2f4e6eae_5cx2d9e51_5cx2d11e9_5cx2db4ee_5cx2d42010a80000f_2dvolumes_2dkubernetes_2eio_5cx7esecret_2dfluentd_5cx2dgcp_5cx2dtoken_5cx2dzfrkb_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958126    132870374    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d667d713fe45ea0af609c85fbfd3aafbca9494574fe10509bda8cd3d13d1e6b66_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958223    132870375    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d667d713fe45ea0af609c85fbfd3aafbca9494574fe10509bda8cd3d13d1e6b66_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958229    132870376    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d0a9f899f23c0965b43cae81dd04f46a78e4b1b063fa5679323146b06932474a9_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958233    132870377    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d0a9f899f23c0965b43cae81dd04f46a78e4b1b063fa5679323146b06932474a9_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958238    132870378    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d875734c124694cd54a7c26516f740cedcb853a18ff5805a89669747c1188b65d_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged

Best Answer

After working with Google we finally resolved this issue. The short version is that we had to create a new node pool running GKE 1.12.8-gke.10 and transfer all our pods using persistent volumes to this pool.

Long Version

Here's something I wish I would have thought about right away when debugging this, but I have to give credit to Francisco and Lynn over at Google's support team for this.

A key moment in the process of troubleshooting was when we segmented our load. At one point we designated a specific node pool to each kind of pod we had. This allowed us to answer the question: is the problem specific to a certain kind of pod? We had a suspicion that mounting activity was related to the issue, so it was of particular interest to see if pods with persistent volumes were correlated with node performance degradation.

This turned out to be true. Only nodes running pods with volumes ran hot.

The root cause

The hypothesis from Google's GKE team is that a change in Docker between versions 17.03 and 18.09 results in massive amounts of systemd activity whenever something is exec'ed in the pods. In particular, one of their engineers identified a change in runc which causes all mount units to be reloaded as part of an exec liveness probe.

So three ingredients are the recipe for this issue:

  1. Docker 18 (as used by GKE 13)
  2. Many volume mounts per node
  3. The use of exec for liveness probes (or otherwise I suppose)

We run about 40 Redis pods per node. As is customary, our Redis pods use an exec based liveness probe. Also each pod has a mounted volume for permanent storage. The liveness probe ran every minute. So we had 40 volumes being reloaded 40 times per minute.

Resolving the issue

To resolve the issue we isolated all the Redis load using node taints. We created a new node pool dedicated to Redis, running GKE 1.12.8-gke.10. While GKE does not allow you to downgrade a cluster, it does allow you to create new node pools running older versions. Naturally we disabled auto-upgrade for this new pool.

Then we forced all the Redis pods to migrate over to this new pool.

The results were immediate: we can now run as much load with 6 CPU cores as we could with 24 before and the situation appears to be stable.

Related Topic