GKE Metrics Agent Logging Errors – Troubleshooting Guide

cloudfirewallgoogle-cloud-platformgoogle-kubernetes-enginemetrics

We have created GKE cluster and we are getting errors from gke-metrics-agent. The errors shows up every cca 30 minutes. It's always the same 62 errors.

All the errors have label k8s-pod/k8s-app: "gke-metrics-agent".

First error is:

error   exporterhelper/queued_retry.go:245  Exporting failed. Try enabling retry_on_failure config option.  {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = DeadlineExceeded desc = Deadline expired before operation could complete."  

This error is followed by these errors in order

  • "go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send"
  • "/go/src/gke-logmon/gke-metrics-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/queued_retry.go:245"
  • go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
  • /go/src/gke-logmon/gke-metrics-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/metrics.go:120

There are cca 40 errors like this. Two errors which stand out are:

- error exporterhelper/queued_retry.go:175  Exporting failed. Dropping data. Try enabling sending_queue to survive temporary failures.  {"kind": "exporter", "name": "googlecloud", "dropped_items": 19}"

- warn  batchprocessor/batch_processor.go:184   Sender failed   {"kind": "processor", "name": "batch", "error": "rpc error: code = DeadlineExceeded desc = Deadline expired before operation could complete."}"

I tried to search those errors on google but I could not find anything. I can't even find any documentation for gke-metrics-agent.

Things I tried:

  • check quotas
  • update GKE to newer version (current version is 1.21.3-gke.2001)
  • update nodes
  • disable all firewall rules
  • give all permissions to k8s nodes

I can provide more information about our kubernetes cluster but I don't know what information may be important to solve this issue.

Best Answer

“Deadline exceeded” is a known issue, metrics are sent to Cloud Monitoring via GKE Metrics agent which is built on top of Open Telemetry. Currently there are two workarounds as following to resolve the issue:

1.Updating timeout.

Since the new release included a change that increases the default timeout from 5 to 12 seconds. So you might need to rebuild and redeploy the workload with the new version that could fix this rpc error.

2.To use higher GKE versions, this issue has a fix with gke-metrics-agent versions: 1.18.6-gke.6400+ 1.19.3-gke.600+ 1.20.0-gke.600+.

Related Topic