Kubernetes – EKS Cluster Nodes Transition from Ready to NotReady

amazon-ekskubernetes

I am using eksctl to set up a cluster on EKS/AWS.

Following the guide in the EKS documentation, I use default values for pretty much everything.

The cluster is created successfully, I update the Kubernetes configuration from the cluster, and I can run the various kubectl commands successfully – e.g. "kubectl get nodes" shows me the nodes are in the "Ready" state.

I do not touch anything else, I have a clean out-of-the-box cluster working with no other changes made and so far it would appear everything is working as expected. I don't deploy any applications to it, I just leave it alone.

The problem is after some relatively short period of time (roughly 30 minutes after the cluster is created), the nodes change from "Ready" to "NotReady" and it never recovers.

The event log shows this (I redacted the IPs):

LAST SEEN   TYPE     REASON                    OBJECT        MESSAGE
22m         Normal   Starting                  node/ip-[x]   Starting kubelet.
22m         Normal   NodeHasSufficientMemory   node/ip-[x]   Node ip-[x] status is now: NodeHasSufficientMemory
22m         Normal   NodeHasNoDiskPressure     node/ip-[x]   Node ip-[x] status is now: NodeHasNoDiskPressure
22m         Normal   NodeHasSufficientPID      node/ip-[x]   Node ip-[x] status is now: NodeHasSufficientPID
22m         Normal   NodeAllocatableEnforced   node/ip-[x]   Updated Node Allocatable limit across pods
22m         Normal   RegisteredNode            node/ip-[x]   Node ip-[x] event: Registered Node ip-[x] in Controller
22m         Normal   Starting                  node/ip-[x]   Starting kube-proxy.
21m         Normal   NodeReady                 node/ip-[x]   Node ip-[x] status is now: NodeReady
7m34s       Normal   NodeNotReady              node/ip-[x]   Node ip-[x] status is now: NodeNotReady

Same events for the other node in the cluster.

Connecting to the instance and inspecting /var/log/messages shows this at the same time the node goes to NotReady:

Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.259207    3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.385044    3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.621271    3896 reflector.go:270] object-"kube-system"/"aws-node-token-bdxwv": Failed to watch *v1.Secret: the server has asked for the client to provide credentials (get secrets)
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.621320    3896 reflector.go:270] object-"kube-system"/"coredns": Failed to watch *v1.ConfigMap: the server has asked for the client to provide credentials (get configmaps)
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.638850    3896 reflector.go:270] k8s.io/client-go/informers/factory.go:133: Failed to watch *v1beta1.RuntimeClass: the server has asked for the client to provide credentials (get runtimeclasses.node.k8s.io)
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.707074    3896 reflector.go:270] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to watch *v1.Pod: the server has asked for the client to provide credentials (get pods)
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.711386    3896 reflector.go:270] object-"kube-system"/"coredns-token-67fzd": Failed to watch *v1.Secret: the server has asked for the client to provide credentials (get secrets)
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.714899    3896 reflector.go:270] object-"kube-system"/"kube-proxy-config": Failed to watch *v1.ConfigMap: the server has asked for the client to provide credentials (get configmaps)
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.720884    3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.868003    3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.868067    3896 controller.go:125] failed to ensure node lease exists, will retry in 200ms, error: Get https://[X]/apis/coordination.k8s.io/v1beta1/namespaces/kube-node-lease/leases/ip-[x]?timeout=10s: write tcp 192.168.91.167:50866->34.249.27.158:443: use of closed network connection
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.017157    3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.017182    3896 kubelet_node_status.go:372] Unable to update node status: update node status exceeds retry count
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.200053    3896 controller.go:125] failed to ensure node lease exists, will retry in 400ms, error: Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.517193    3896 reflector.go:270] object-"kube-system"/"kube-proxy": Failed to watch *v1.ConfigMap: the server has asked for the client to provide credentials (get configmaps)
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.729756    3896 controller.go:125] failed to ensure node lease exists, will retry in 800ms, error: Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.752267    3896 reflector.go:126] object-"kube-system"/"aws-node-token-bdxwv": Failed to list *v1.Secret: Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.824988    3896 reflector.go:126] object-"kube-system"/"coredns": Failed to list *v1.ConfigMap: Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.899566    3896 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.963756    3896 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.963822    3896 reflector.go:126] object-"kube-system"/"kube-proxy-config": Failed to list *v1.ConfigMap: Unauthorized

CloudWatch logs for the authenticator component show many of these messages:

time="2020-03-07T10:40:37Z" level=warning msg="access denied" arn="arn:aws:iam::[ACCOUNT_ID]]:role/AmazonSSMRoleForInstancesQuickSetup" client="127.0.0.1:50132" error="ARN is not mapped: arn:aws:iam::[ACCOUNT_ID]:role/amazonssmroleforinstancesquicksetup" method=POST path=/authenticate

I confirmed that role does exist in via IAM console.

Clearly this node is reporting NotReady because of these authentication failures.

Is this some authentication token that timed out after approximately 30 minutes, and if so shouldn't a new token automatically be requested? Or am I supposed to set something else up?

I was surprised that a fresh cluster created by eksctl would show this problem.

What did I miss?

Best Answer

These are the steps I followed to resolve this issue...

Connect to the failing instance via SSH.
Execute "aws sts get-caller-identity"
Note the ARN of the user, it will likely be something like this arn:aws:sts::999999999999:assumed-role/AmazonSSMRoleForInstancesQuickSetup/i-00000000000ffffff

Note the role here is AmazonSSMRoleForInstancesQuickSetup, this seems wrong to me - but AFAIK I followed the guides to the letter when creating the cluster.

Issues so far:

a) Why is this role being used for the AWS identity?

b) If this is the right role, why is it successful at first and only fails 30 minutes after cluster creation?

c) If this is the right role, what access rights are missing?

Personally, this feels like it is the wrong role to me, but I solved my problem by addressing point (c).

Continuing the steps...

If this role is inspected via the IAM service in the AWS console, it can be seen that it does not have all of the required permissions, by default it has:

AmazonSSMManagedInstanceCore

Assuming this role is the correct role to use, then it needs at least the following policy added to it:

AmazonEC2ContainerRegistryPowerUser

Attach that policy in the usual way, I admit this may be granting more privileges than needed but that's for another day.

At this point, the AWS security configuration should now be correct, but this is not the end of the story.

Kubernetes, through the kubelet process, has its own security role mappings to consider - this is to map Kubernetes users to IAM users or roles on AWS.

This configuration is maintained by editing a Kubernetes configmap.

Edit the configmap with "kubectl edit -n kube-system configmap/aws-auth".

This is the configuration immediately after creating the cluster, before making any changes:

apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::999999999999:role/eksctl-my-demo-nodegroup-my-demo-NodeInstanceRole-AAAAAAAAAAAAA
      username: system:node:{{EC2PrivateDNSName}}
kind: ConfigMap
metadata:
  [...whatever...]

The only role mapped here is the node instance role - this role was created automatically during the provisioning of the cluster via eksctl.

Change the configmap:

apiVersion: v1
data:
  mapRoles: |
    - rolearn: arn:aws:iam::999999999999:role/eksctl-my-demo-nodegroup-my-demo-NodeInstanceRole-AAAAAAAAAAAAA
      username: system:node:{{EC2PrivateDNSName}}
      groups:
      - system:bootstrappers
      - system:nodes
    - rolearn: arn:aws:iam::999999999999:role/AmazonSSMRoleForInstancesQuickSetup
      username: MyDemoEKSRole
      groups:
      - system:masters
    - rolearn: arn:aws:iam::999999999999:role/MyDemoEKSRole
      username: CodeBuild
      groups:
      - system:masters
      - system:bootstrappers
      - system:nodes
kind: ConfigMap
metadata:
  [...whatever...]

I have mapped the AmazonSSMRoleForInstancesQuickSetup role as a Kubernetes masters role.

I have also mapped the MyDemoEKSRole cluster security role previously created for cluster provisioning to the various Kubernetes roles, for the case where Kubernetes is being invoked by a CodeBuild pipeline.

Save this config map and eventually the cluster will repair itself and report ready.

Conclusion:

After executing all of these cluster post-creation steps, my authentication failures ceased, and the cluster started reporting a successful status again, clearing the health-check and returning the node to a Ready status.

I freely admit this might not be the "right" way to solve my issue, and it definitely feels like I opened up the security way more than I should have, but it definitely worked and solved my problem.

As mentioned shortly after this we transitioned to Azure instead of AWS so I never took this any further - but I did end up with a fully working cluster with no longer any expiring credentials.

Naively I suppose I expected the tools to create a working cluster for me. There was no mention of this issue or these steps anywhere in any guide that I found.

Best Answer

Related Solutions

Kubectl logs error: You must be logged in to the server (the server has asked for the client to provide credentials

Amazon EKS – Fix EKS ARM Node Stuck in NotReady Status

Related Topic