I am running a GKE cluster, and sometimes, one of the nodes has issues with specific containers built from php7-alpine
.
We run two types of containers, the first type is built from php7-alpine
, and the second type is built from the first type. (php7-alpine
-> Base App
-> App with extra
). Only our Base App
Pods
have these issues.
So far, I've seen the following errors:
failed to reserve container name
FailedSync: error determining status: rpc error: code = Unknown desc = Error: No such container: XYZ
Error: context deadline exceeded context deadline exceeded: CreateContainerError
There is plenty of disk space left on the nodes, kubectl describe pod
doesn't contain any relevant/helpful information.
A few more details:
- Out of 50
Base app
, 6 pods are in error, and out of theApp with extra
pods, none are failing. - All failing pods are always on the same node.
- We've recreated/replaced the nodes. Problem still appear , if we replace the node with faulty pods, we have a 50/50% of having all the pods being OK on the next node. Problem appear somewhat random.
- Running GKE v1.17.9-gke.1504
- We are running on preemptible nodes.
- container image is quite big (~3gb, working on reducing that).
- Issue started probably around a month ago.
I really have no clues on what to look for, I've look extensively to find a similar issue. Any help is greatly appreciated!
Update:
Here is the deployment
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: my-app
appType: web
env: prod
name: my-app
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: my-app
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
labels:
app: my-app
version: v1.0
spec:
containers:
image: richarvey/nginx-php-fpm:latest # We build upon that image to add content and services
lifecycle:
preStop:
exec:
command:
- /entry-point/stop.sh
name: web
ports:
- containerPort: 80
protocol: TCP
resources:
requests:
cpu: 50m
memory: 1500Mi
- image: redis:4.0-alpine
name: redis
resources:
requests:
cpu: 25m
memory: 25Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
Best Answer
The issue was investigated and fixed.
https://github.com/containerd/containerd/issues/4604