We have created GKE cluster and we are getting errors from gke-metrics-agent. The errors shows up every cca 30 minutes. It's always the same 62 errors.
All the errors have label k8s-pod/k8s-app: "gke-metrics-agent".
First error is:
error exporterhelper/queued_retry.go:245 Exporting failed. Try enabling retry_on_failure config option. {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = DeadlineExceeded desc = Deadline expired before operation could complete."
This error is followed by these errors in order
- "go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send"
- "/go/src/gke-logmon/gke-metrics-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/queued_retry.go:245"
- go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
- /go/src/gke-logmon/gke-metrics-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/metrics.go:120
There are cca 40 errors like this. Two errors which stand out are:
- error exporterhelper/queued_retry.go:175 Exporting failed. Dropping data. Try enabling sending_queue to survive temporary failures. {"kind": "exporter", "name": "googlecloud", "dropped_items": 19}"
- warn batchprocessor/batch_processor.go:184 Sender failed {"kind": "processor", "name": "batch", "error": "rpc error: code = DeadlineExceeded desc = Deadline expired before operation could complete."}"
I tried to search those errors on google but I could not find anything. I can't even find any documentation for gke-metrics-agent.
Things I tried:
- check quotas
- update GKE to newer version (current version is 1.21.3-gke.2001)
- update nodes
- disable all firewall rules
- give all permissions to k8s nodes
I can provide more information about our kubernetes cluster but I don't know what information may be important to solve this issue.
Best Answer
“Deadline exceeded” is a known issue, metrics are sent to Cloud Monitoring via GKE Metrics agent which is built on top of Open Telemetry. Currently there are two workarounds as following to resolve the issue:
1.Updating timeout.
Since the new release included a change that increases the default timeout from 5 to 12 seconds. So you might need to rebuild and redeploy the workload with the new version that could fix this rpc error.
2.To use higher GKE versions, this issue has a fix with gke-metrics-agent versions: 1.18.6-gke.6400+ 1.19.3-gke.600+ 1.20.0-gke.600+.