Kubernetes metrics-server giving context deadline exceeded

google-kubernetes-enginekubernetes

I am using multiple GKE managed clusters on version 1.14.8-gke.12 in a shared VPC setting.
Suddenly, one of my clusters has stopped giving proper metrics for HPA. The metric server is up and running, but this is the output on HPA:

NAME                                    REFERENCE                                          TARGETS                        MINPODS   MAXPODS   REPLICAS   AGE
nginx-public-nginx-ingress-controller   Deployment/nginx-public-nginx-ingress-controller   <unknown>/50%, <unknown>/50%   2         11        2          93m

Checking the default metrics-server installation on gke, I saw the following in logs:

E1221 18:53:13.491188       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:NODE_NAME: unable to fetch metrics from Kubelet NODE_NAME (NODE_IP): Get http://NODE_IP:10255/stats/summary/: context deadline exceeded
E1221 18:53:43.421617       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:NODE_NAME: unable to fetch metrics from Kubelet NODE_NAME (NODE_IP): Get http://NODE_IP:10255/stats/summary/: dial tcp NODE_IP:10255: i/o timeout

Running a curl on the said address manually gives me all data within 10 milliseconds. I've checked the network configurations and both the pod network range as well as the node network range have access to this port.

Questions:

What is the default timeout on metrics-server? Can we change it on Google a managed cluster?
This is a production cluster and I am unable to replicate this issue on any other cluster, but could disabling Google's Horizontal Pod Autoscaling support and installing metrics-server manually help here?

Additionally, quite an expected fashion, updating to 1.15 didn’t help here.

Best Answer

At first, I'd recommend you to check if you still have default firewall rules at VPC network->Firewall rules to be sure that all metric requests are able to go through your firewall.

Then try to reach each nodes of your cluster using curl and get metrics.

After that, look for some logs at Stackdriver->Logging with filter like this:

resource.type="k8s_container"
resource.labels.project_id="YOUR_PROJECT_ID"
resource.labels.cluster_name="YOUR_CLUSTER_NAME"
resource.labels.namespace_name="kube-system"
labels.k8s-pod/k8s-app="metrics-server"
labels.k8s-pod/version="YOUR_VERSION_OF_METRICS_SERVER"
severity>=WARNING

and with additional line:

resource.type="k8s_container"
resource.labels.project_id="YOUR_PROJECT_ID"
resource.labels.cluster_name="YOUR_CLUSTER_NAME"
resource.labels.namespace_name="kube-system"
labels.k8s-pod/k8s-app="metrics-server"
labels.k8s-pod/version="YOUR_VERSION_OF_METRICS_SERVER"
severity>=WARNING
"503"

and share them here.

Related Solutions

Amazon EKS – Implementing HorizontalPodAutoscaling

EDITED

The AWS has released the support to horizontal pod autoscaling with custom metric: https://aws.amazon.com/pt/about-aws/whats-new/2018/08/amazon-eks-supports-horizontal-pod-autoscaling-with-custom-metric/

PREVIOUS ANSWER

This is a known issue currently with EKS. Here is my response from support (Partially omitted):

Unfortunately, the EKS control plane currently has issues with the metrics-server aggregator. The EKS service team are working to these issues to bring metric-server support to the Control Plane.

As soon as this feature gets released, it should be publicly announced in either one of the following links:

AWS blogs: https://blogs.amazon.com/

What's New: https://aws.amazon.com/new/

AWS Release Notes: https://aws.amazon.com/releasenotes/

I have indicated your interest in this issue and I will advocate for a solution on your behalf.

Hopefully they sort it out soon.

Kubernetes Metrics-Server SSL Troubleshooting Guide

Following the https://github.com/kubernetes-incubator/metrics-server/issues/67, https://github.com/kubernetes-incubator/metrics-server/issues/146 and https://github.com/kubernetes-incubator/metrics-server/issues/131 you may want try use next solution:

For future readers scratching their heads: on a Kubernetes 1.13 cluster deployed with kubeadm, metrics server started working once I updated the deployment spec with the following:

 command:
        - /metrics-server
        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP

(After that, give it a few minutes before kubectl top actually has enough data to show anything, though.)

Or at least try to modify metrics-server Deployment to

    command:
    - /metrics-server
    - --kubelet-insecure-tls

Best Answer

Related Solutions

Amazon EKS – Implementing HorizontalPodAutoscaling

Kubernetes Metrics-Server SSL Troubleshooting Guide

Related Topic