I am using eksctl to set up a cluster on EKS/AWS.
Following the guide in the EKS documentation, I use default values for pretty much everything.
The cluster is created successfully, I update the Kubernetes configuration from the cluster, and I can run the various kubectl commands successfully – e.g. "kubectl get nodes" shows me the nodes are in the "Ready" state.
I do not touch anything else, I have a clean out-of-the-box cluster working with no other changes made and so far it would appear everything is working as expected. I don't deploy any applications to it, I just leave it alone.
The problem is after some relatively short period of time (roughly 30 minutes after the cluster is created), the nodes change from "Ready" to "NotReady" and it never recovers.
The event log shows this (I redacted the IPs):
LAST SEEN TYPE REASON OBJECT MESSAGE
22m Normal Starting node/ip-[x] Starting kubelet.
22m Normal NodeHasSufficientMemory node/ip-[x] Node ip-[x] status is now: NodeHasSufficientMemory
22m Normal NodeHasNoDiskPressure node/ip-[x] Node ip-[x] status is now: NodeHasNoDiskPressure
22m Normal NodeHasSufficientPID node/ip-[x] Node ip-[x] status is now: NodeHasSufficientPID
22m Normal NodeAllocatableEnforced node/ip-[x] Updated Node Allocatable limit across pods
22m Normal RegisteredNode node/ip-[x] Node ip-[x] event: Registered Node ip-[x] in Controller
22m Normal Starting node/ip-[x] Starting kube-proxy.
21m Normal NodeReady node/ip-[x] Node ip-[x] status is now: NodeReady
7m34s Normal NodeNotReady node/ip-[x] Node ip-[x] status is now: NodeNotReady
Same events for the other node in the cluster.
Connecting to the instance and inspecting /var/log/messages shows this at the same time the node goes to NotReady:
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.259207 3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.385044 3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.621271 3896 reflector.go:270] object-"kube-system"/"aws-node-token-bdxwv": Failed to watch *v1.Secret: the server has asked for the client to provide credentials (get secrets)
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.621320 3896 reflector.go:270] object-"kube-system"/"coredns": Failed to watch *v1.ConfigMap: the server has asked for the client to provide credentials (get configmaps)
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.638850 3896 reflector.go:270] k8s.io/client-go/informers/factory.go:133: Failed to watch *v1beta1.RuntimeClass: the server has asked for the client to provide credentials (get runtimeclasses.node.k8s.io)
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.707074 3896 reflector.go:270] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to watch *v1.Pod: the server has asked for the client to provide credentials (get pods)
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.711386 3896 reflector.go:270] object-"kube-system"/"coredns-token-67fzd": Failed to watch *v1.Secret: the server has asked for the client to provide credentials (get secrets)
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.714899 3896 reflector.go:270] object-"kube-system"/"kube-proxy-config": Failed to watch *v1.ConfigMap: the server has asked for the client to provide credentials (get configmaps)
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.720884 3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.868003 3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar 7 10:40:37 ip-[X] kubelet: E0307 10:40:37.868067 3896 controller.go:125] failed to ensure node lease exists, will retry in 200ms, error: Get https://[X]/apis/coordination.k8s.io/v1beta1/namespaces/kube-node-lease/leases/ip-[x]?timeout=10s: write tcp 192.168.91.167:50866->34.249.27.158:443: use of closed network connection
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.017157 3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.017182 3896 kubelet_node_status.go:372] Unable to update node status: update node status exceeds retry count
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.200053 3896 controller.go:125] failed to ensure node lease exists, will retry in 400ms, error: Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.517193 3896 reflector.go:270] object-"kube-system"/"kube-proxy": Failed to watch *v1.ConfigMap: the server has asked for the client to provide credentials (get configmaps)
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.729756 3896 controller.go:125] failed to ensure node lease exists, will retry in 800ms, error: Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.752267 3896 reflector.go:126] object-"kube-system"/"aws-node-token-bdxwv": Failed to list *v1.Secret: Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.824988 3896 reflector.go:126] object-"kube-system"/"coredns": Failed to list *v1.ConfigMap: Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.899566 3896 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.963756 3896 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Unauthorized
Mar 7 10:40:38 ip-[X] kubelet: E0307 10:40:38.963822 3896 reflector.go:126] object-"kube-system"/"kube-proxy-config": Failed to list *v1.ConfigMap: Unauthorized
CloudWatch logs for the authenticator component show many of these messages:
time="2020-03-07T10:40:37Z" level=warning msg="access denied" arn="arn:aws:iam::[ACCOUNT_ID]]:role/AmazonSSMRoleForInstancesQuickSetup" client="127.0.0.1:50132" error="ARN is not mapped: arn:aws:iam::[ACCOUNT_ID]:role/amazonssmroleforinstancesquicksetup" method=POST path=/authenticate
I confirmed that role does exist in via IAM console.
Clearly this node is reporting NotReady because of these authentication failures.
Is this some authentication token that timed out after approximately 30 minutes, and if so shouldn't a new token automatically be requested? Or am I supposed to set something else up?
I was surprised that a fresh cluster created by eksctl would show this problem.
What did I miss?
Best Answer
These are the steps I followed to resolve this issue...
Connect to the failing instance via SSH.
Execute "aws sts get-caller-identity"
Note the ARN of the user, it will likely be something like this arn:aws:sts::999999999999:assumed-role/AmazonSSMRoleForInstancesQuickSetup/i-00000000000ffffff
Note the role here is AmazonSSMRoleForInstancesQuickSetup, this seems wrong to me - but AFAIK I followed the guides to the letter when creating the cluster.
Issues so far:
a) Why is this role being used for the AWS identity?
b) If this is the right role, why is it successful at first and only fails 30 minutes after cluster creation?
c) If this is the right role, what access rights are missing?
Personally, this feels like it is the wrong role to me, but I solved my problem by addressing point (c).
Continuing the steps...
Attach that policy in the usual way, I admit this may be granting more privileges than needed but that's for another day.
At this point, the AWS security configuration should now be correct, but this is not the end of the story.
This configuration is maintained by editing a Kubernetes configmap.
Edit the configmap with "kubectl edit -n kube-system configmap/aws-auth".
This is the configuration immediately after creating the cluster, before making any changes:
The only role mapped here is the node instance role - this role was created automatically during the provisioning of the cluster via
eksctl
.I have mapped the AmazonSSMRoleForInstancesQuickSetup role as a Kubernetes masters role.
I have also mapped the
MyDemoEKSRole
cluster security role previously created for cluster provisioning to the various Kubernetes roles, for the case where Kubernetes is being invoked by a CodeBuild pipeline.Conclusion:
After executing all of these cluster post-creation steps, my authentication failures ceased, and the cluster started reporting a successful status again, clearing the health-check and returning the node to a
Ready
status.I freely admit this might not be the "right" way to solve my issue, and it definitely feels like I opened up the security way more than I should have, but it definitely worked and solved my problem.
As mentioned shortly after this we transitioned to Azure instead of AWS so I never took this any further - but I did end up with a fully working cluster with no longer any expiring credentials.
Naively I suppose I expected the tools to create a working cluster for me. There was no mention of this issue or these steps anywhere in any guide that I found.