Docker – Kubernetes Pod OOMKilled Issue

dockerkubernetesmemoryoom

The scenario is we run some web sites based on an nginx image in kubernetes cluster. When we had our cluster setup with nodes of 2cores and 4GB RAM each. The pods had the following configurations, cpu: 40m and memory: 100MiB. Later, we upgraded our cluster with nodes of 4cores and 8GB RAM each. But kept on getting OOMKilled in every pod. So we increased memory on every pods to around 300MiB and then every thing seems to be working fine.

My question is why does this happen and how do I solve it. P.S. if we revert back to each node being 2cores and 4GB RAM, the pods work just fine with decreased resources of 100MiB.

Best Answer

First of all, the pod should not request more memory/CPU just because the node has gotten more resources. Without your specs it is hard to point out what might be wrong config wise but I would like to explain the concept to make it more clear.

You mentioned your pod configuration but did not specify if these are limits or requests.

Requests are what the container is guaranteed to get. If a container requests a resource, Kubernetes will only schedule it on a node that can give it that resource. These will not cause OOMs, they will cause pod not to get scheduled.
Limits, on the other hand, make sure a container never goes above a certain value. This can cause OOM kill.

Requests and limits are on a per-container basis. While Pods usually contain a single container, it’s common to see Pods with multiple containers as well. Each container in the Pod gets its own individual limit and request, but because Pods are always scheduled as a group, you need to add the limits and requests for each container together to get an aggregate value for the Pod.

Below is an example of a pod that has 2 containers with the same specified requests and limits:

apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
  - name: db
    image: mysql
    env:
    - name: MYSQL_ROOT_PASSWORD
      value: "password"
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"
  - name: wp
    image: wordpress
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

If you want to calculate requests and limits for the whole pod you need to sum those values, making it: request of 0.5 cpu and 128 MiB of memory, and a limit of 1 cpu and 256MiB of memory.

If you want to find more regarding that topic than check out the official documentation:

Please, let me now if that helped.

Related Solutions

Windows – No DNS resolution and Internet access from Kubernetes POD on Windows

This is a community wiki answer posted for better visibility. Feel free to expand it.

As already confirmed by @dzup4uk, Reason for the communication failure was that there was two networks connected to each of kubernetes machines.

linux master - internet facing eth0 (192.168.6.2) eth1 (192.168.3.12)
linux worker - internet facing eth0 (192.168.6.3) eth1 (192.168.3.13)
windows node - internet facing Ethernet_LB (192.168.6.4) Ethernet_FW (192.168.3.14)

Virtual Switch on windows was created on Ethernet_FW which had no internet connection. After running Wireshark and tdcdump I was able to find out that

pods on master send traffic to windows pods via eth0
pods on windows node send traffic to linux pods visa Ethernet_FW
pods from windows send traffic to the internet via Ethernet_FW
That caused communication failure between pods.
I was able to configure windows node to create Virtual Switch on Ethernet_LB which had internet access and this interface (Ethernet_LB) was accepting packets from master and linux worker.

It was required to put proper interface name into config file

"InterfaceName" : "Ethernet_LB"

The Kubernetes cluster v1.19.7 was build using Windows Server 2019 with the latest updates as WIndows worker nodes.

For cluster creation this resource was used:

Adding Windows nodes

Other useful resources:

Kubernetes Pod OutOfMemory – Immediate Failure After Scheduling

It's known issue for the 1.22.x versions - you can find a multiple GitHub and Stackoverflow topics about this, for example:

The fix for the issue is included in the 1.23 version:

Fix a regression where the Kubelet failed to exclude already completed pods from calculations about how many resources it was currently using when deciding whether to allow more pods. (#104577, @smarterclayton)

So please just upgrade your Kubernetes cluster to the newest stable version.

I hope it will help you, but keep in mind another similar issue is open on the Github even with the fix applied (mentioned here about 10 days ago - state for 13 January 2022):

Linking here for completeness - a similar symptom might get exposed after this fix as described in #106884. The kubelet considers resources for terminating pods to be in use (they are!), but the scheduler ignores terminating pods and schedules new pods. Because the kubelet now considers terminating pods, it rejects those rapidly rescheduled pods.

Then, probably the only solution is to downgrade to the 1.21 version.

Best Answer

Related Solutions

Windows – No DNS resolution and Internet access from Kubernetes POD on Windows

Kubernetes Pod OutOfMemory – Immediate Failure After Scheduling

Related Topic