GCP f1-micro instances only last a few hours before being replaced

google-cloud-platformgoogle-compute-enginegoogle-kubernetes-engine

I have a 3-6 f1-micro instance GKE (k8) cluster that is plagued by the f1-micro instances constantly re-creating.

Looking at the cluster now it is currently scaled to 3 instances with the following uptimes: 10hrs, 3hrs, 1hr.

Why are my instances constantly churning? How can I debug the 'why' of instances constantly being added and removed from the instance group?

The instances are NOT preemptable. I notice in GCP they do have automatic restart set in the "availability" section.

Any help much appreciated.

Additional info:

My suspicion is the reason I'm seeing this is trying to run GKE on f1-micro instances. I've switched to a g-small instance instead and it already seems more stable.

I notice that in the stackdriver Monitoring Overview (http://app.google.stackdriver.com/) I have a lot of "gke-my-instance-xzy in cluster X is not ready" in the Events box. This is the first place in logs I've managed to find such a message. So I'd conclude that the instances are reporting unhealthy at some layer and eventually getting killed. I see recreateInstance (or something similar) in the logs frequently.

Which logs to look in to find the right health check I couldn't determine. I did notice this in one set of logs --eviction-hard=memory.available<100Mi if that means instances hard shutdown when they have less than 100MB of memory then I imagine I was hitting that. I still can't see and 'healthcheck failed' type messages in any logs.

Additional info:

I have confirmed that moving up to one small sized instance all the instability goes away. It would seem running GKE on f1-micro instances isn't currently a good idea.

I'm leaving the question open because it is about how I could debug why my f1-micro instances were being recreated so frequently and Sunny's answer didn't lead me to a 'why' message anywhere in the logs.

Solution

In a practical sense moving to a larger node size solved the problem as noted above. In the OP comments @Daniel has provided a link to a page that provides the command needed to view the logs


gcloud container operations list

I can see in the output of this command all the auto-repair events that were killing my nodes.

Best Answer

Automatic restart is always an optional availability policy you can set on the instance template used on the instance group. When set, this will allow compute engine to automatically restart VM instances if they are terminated for non-user-initiated reasons such as (maintenance event, hardware failure, software failure, etc.) so is unlikely to be that is a result of this policy that instances are getting added and removed from your instance group.

As described in the following documentation, when your applications require additional compute resources, managed instance groups can automatically scale the number of instances in the group.

In addition, Managed instance groups can automatically identify and recreate unhealthy instances in a group to ensure that all of the instances are running optimally.

Finally;

If an instance in the group stops, crashes, or is deleted by an action other than the instance groups commands, the managed instance group automatically recreates the instance so it can resume its processing tasks. The recreated instance uses the same name and the same instance template as the previous instance, even if the group references a different instance template.

You may also view the autoscaler logs as described in this public documentation to confirm if autoscaler is behind this behavior.

OR

Convert the "Filter by label field in Stackdriver to advanced filter and define the following filters

resource.type="gce_autoscaler"
protoPayload.methodName="v1.compute.autoscalers.insert"

For list of all instance created by autoscaler

AND

resource.type="gce_autoscaler"
protoPayload.methodName="v1.compute.autoscalers.delete"

For deleted instances.

Related Topic