I am using multiple GKE managed clusters on version 1.14.8-gke.12
in a shared VPC setting.
Suddenly, one of my clusters has stopped giving proper metrics for HPA. The metric server is up and running, but this is the output on HPA:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
nginx-public-nginx-ingress-controller Deployment/nginx-public-nginx-ingress-controller <unknown>/50%, <unknown>/50% 2 11 2 93m
Checking the default metrics-server installation on gke, I saw the following in logs:
E1221 18:53:13.491188 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:NODE_NAME: unable to fetch metrics from Kubelet NODE_NAME (NODE_IP): Get http://NODE_IP:10255/stats/summary/: context deadline exceeded
E1221 18:53:43.421617 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:NODE_NAME: unable to fetch metrics from Kubelet NODE_NAME (NODE_IP): Get http://NODE_IP:10255/stats/summary/: dial tcp NODE_IP:10255: i/o timeout
Running a curl on the said address manually gives me all data within 10 milliseconds. I've checked the network configurations and both the pod network range as well as the node network range have access to this port.
Questions:
-
What is the default timeout on metrics-server? Can we change it on Google a managed cluster?
-
This is a production cluster and I am unable to replicate this issue on any other cluster, but could disabling Google's Horizontal Pod Autoscaling support and installing metrics-server manually help here?
Additionally, quite an expected fashion, updating to 1.15 didn’t help here.
Best Answer
At first, I'd recommend you to check if you still have default firewall rules at
VPC network
->Firewall rules
to be sure that all metric requests are able to go through your firewall.Then try to reach each nodes of your cluster using
curl
and get metrics.After that, look for some logs at
Stackdriver
->Logging
with filter like this:and with additional line:
and share them here.