GKE – Context Deadline Exceeded: CreateContainerError and Failed to Reserve Container Name

dockergoogle-cloud-platformgoogle-kubernetes-enginekubernetes

I am running a GKE cluster, and sometimes, one of the nodes has issues with specific containers built from php7-alpine.

We run two types of containers, the first type is built from php7-alpine, and the second type is built from the first type. (php7-alpine -> Base App -> App with extra). Only our Base App Pods have these issues.

So far, I've seen the following errors:

  • failed to reserve container name
  • FailedSync: error determining status: rpc error: code = Unknown desc = Error: No such container: XYZ
  • Error: context deadline exceeded context deadline exceeded: CreateContainerError

There is plenty of disk space left on the nodes, kubectl describe pod doesn't contain any relevant/helpful information.

A few more details:

  • Out of 50 Base app, 6 pods are in error, and out of the App with extra pods, none are failing.
  • All failing pods are always on the same node.
  • We've recreated/replaced the nodes. Problem still appear , if we replace the node with faulty pods, we have a 50/50% of having all the pods being OK on the next node. Problem appear somewhat random.
  • Running GKE v1.17.9-gke.1504
  • We are running on preemptible nodes.
  • container image is quite big (~3gb, working on reducing that).
  • Issue started probably around a month ago.

I really have no clues on what to look for, I've look extensively to find a similar issue. Any help is greatly appreciated!

Update:

Here is the deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: my-app
    appType: web
    env: prod
  name: my-app
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: my-app
        version: v1.0
    spec:
      containers:
          image: richarvey/nginx-php-fpm:latest  # We build upon that image to add content and services
          lifecycle:
            preStop:
              exec:
                command:
                  - /entry-point/stop.sh
          name: web
          ports:
            - containerPort: 80
              protocol: TCP
          resources:
            requests:
              cpu: 50m
              memory: 1500Mi
        - image: redis:4.0-alpine
          name: redis
          resources:
            requests:
              cpu: 25m
              memory: 25Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File

Best Answer

The issue was investigated and fixed.

https://github.com/containerd/containerd/issues/4604