Google Kubernetes Engine – Node Pool Autoscaling Issues


I am trying to run a machine learning job on GKE, and need to use a GPU.

I created a node pool with Tesla K80, as described in this walkthrough.

I set the minimum node size to 0, and hoped that the autoscaler would automatically determine how many nodes I needed based on my jobs:

gcloud container node-pools create [POOL_NAME] \
--accelerator type=nvidia-tesla-k80,count=1 --zone [COMPUTE_ZONE] \
--cluster [CLUSTER_NAME] --num-nodes 3 --min-nodes 0 --max-nodes 5 \

Initially, there are no jobs that require GPUs, so the cluster autoscaler correctly downsizes the node pool to 0.

However, when I create job with the following specification

  requests: "1"
  limits: "1"

Here is the full job configuration. (Please note that this configuration is partially auto-generated. I have also removed some environment variables that are not pertinent to the issue).

the pod is stuck pending with Insufficient until I manually increase the node pool to at least 1 node.

Is this a current limitation of GPU node pools, or did I overlook something?

Autoscaler supports scaling GPU nodepools (including to and from 0).

One possible reason for this problem is if you have enabled Node Auto-Provisioning and set resouce limits (via UI or gcloud flags such as --max-cpu, max-memory, etc). Those limits apply to ALL autoscaling in the cluster, including nodepools you created manually with enabled autoscaling (see note in documentation:

In particular if you have enabled NAP and you want to autoscale nodepools with GPUs you need to set resouce limits for GPUs as described in

Finally, autoprovisioning also supports GPUs, so (assuming you set the resource limits as described above) you don't actually need to create nodepool for your GPU workload - NAP will create one for you automatically.


Also, for future reference - if autoscaler fails to create nodes for some of your pods, you can try to debug it using autoscaler events:

  • On your pod (kubectl describe pod <your-pod>) there should be one of the 2 events (it may take a minute until they show up):
    • TriggeredScaleUp - this mean the autoscaler decided to add a node for this pod.
    • NotTriggerScaleUp - autoscaler spotted your pod, but it doesn't think any nodepool can be scaled up to help it. In 1.12 and later the event contains a list of reasons why adding nodes to different nodepools wouldn't help the pod. This is usually the most useful event for debugging.
  • kubectl get events -n kube-system | grep cluster-autoscaler will give you events describing all autoscaler actions (scale-up, scale-down). If a scale-up was attempted, but failed for whatever reason it will also have events describing that.

Note that events are only available in Kubernetes for 1 hour after they were created. You can see historical events in Stackdriver by going to UI and navigating to Stackdriver->Logging->Logs and choosing "GKE Cluster Operations" in drop-down.

Finally you can check the current status of autoscaler by running kubectl get configmap cluster-autoscaler-status -o yaml -n kube-system.

