I have a 3-6 f1-micro instance GKE (k8) cluster that is plagued by the f1-micro instances constantly re-creating.
Looking at the cluster now it is currently scaled to 3 instances with the following uptimes: 10hrs, 3hrs, 1hr.
Why are my instances constantly churning? How can I debug the 'why' of instances constantly being added and removed from the instance group?
The instances are NOT preemptable. I notice in GCP they do have automatic restart set in the "availability" section.
Any help much appreciated.
Additional info:
My suspicion is the reason I'm seeing this is trying to run GKE on f1-micro instances. I've switched to a g-small instance instead and it already seems more stable.
I notice that in the stackdriver Monitoring Overview (http://app.google.stackdriver.com/) I have a lot of "gke-my-instance-xzy in cluster X is not ready" in the Events box. This is the first place in logs I've managed to find such a message. So I'd conclude that the instances are reporting unhealthy at some layer and eventually getting killed. I see recreateInstance (or something similar) in the logs frequently.
Which logs to look in to find the right health check I couldn't determine. I did notice this in one set of logs --eviction-hard=memory.available<100Mi
if that means instances hard shutdown when they have less than 100MB of memory then I imagine I was hitting that. I still can't see and 'healthcheck failed' type messages in any logs.
Additional info:
I have confirmed that moving up to one small sized instance all the instability goes away. It would seem running GKE on f1-micro instances isn't currently a good idea.
I'm leaving the question open because it is about how I could debug why my f1-micro instances were being recreated so frequently and Sunny's answer didn't lead me to a 'why' message anywhere in the logs.
Solution
In a practical sense moving to a larger node size solved the problem as noted above. In the OP comments @Daniel has provided a link to a page that provides the command needed to view the logs
gcloud container operations list
I can see in the output of this command all the auto-repair events that were killing my nodes.
Best Answer
Automatic restart
is always an optional availability policy you can set on the instance template used on the instance group. When set, this will allow compute engine to automatically restart VM instances if they are terminated for non-user-initiated reasons such as (maintenance event, hardware failure, software failure, etc.) so is unlikely to be that is a result of this policy that instances are getting added and removed from your instance group.As described in the following documentation, when your applications require additional compute resources, managed instance groups can automatically scale the number of instances in the group.
In addition, Managed instance groups can automatically identify and recreate unhealthy instances in a group to ensure that all of the instances are running optimally.
Finally;
You may also view the autoscaler logs as described in this public documentation to confirm if autoscaler is behind this behavior.
OR
Convert the "Filter by label field in Stackdriver to advanced filter and define the following filters
For list of all instance created by autoscaler
AND
For deleted instances.