Google Compute Engine – Can’t Connect via SSH? VM Loses Network Access?

google-compute-engine

EDIT: This was an out-of-control application process, not GCE. Here's the issue, and answered below:

I just had some kind of outage with my CE VM on a trial account, but I don't see any outage reported on the Google Compute Outage list.

I'm not sure how long it lasted since I'm not sure when it started. From the behavior it matches something that seemed to happen a few weeks ago (losing the ability to log in with SSH over the Compute Engine dashboard until the VM was rebooted).

My test VM disconnected my SSH connection in the last day or so, and when I noticed today I was unable to reconnect. I then tried to connect with SSH using "SSH" connect on the Compute Engine VM list, and that failed. The only thing I could do was get a prompt on the serial console… but I didn't have a password-enabled account at all, I was relying on SSH (now fixed). I had to stop the VM and restart it… then I could connect using the "SSH" connect option on the VM list, although I could NOT connect from outside. I connected to the serial console and saw some network error messages trying to connect to various snaps. I tried to SSH to a remote server from my SSH window into the VM, and initially could not. After a minute or so that worked, and suddenly remote connections worked again.

EDIT: I got a response from my support request from Google. They're saying I experienced a Live Migration event. That doesn't sound right. This was at least 10 minutes of disrupted networking. I could connect to the serial console, and it seemed responsive. It was only after rebooting and the failure of the google management snaps to initialize that it appeared to suddenly start working. Maybe a failure of communication in boot triggered the migration event? I don't know.

EDIT: I removed my worrying about GCE's stability since the infrastructure had nothing to do with the problem.

Best Answer

There may be a number of reasons for this to happen. I would recommend checking the SSH troubleshooting document for more information about how to troubleshoot this issue.

This issue could also occur if the Linux guest environment did not initiate properly after the live migration. The guest environments includes a set of scripts and processes that run contents from a metadata server and creates the proper environment for a virtual machine to run. It might be possible that the SSH keys were not set properly during the guest environment setup.

You may also set the 'automaticRestart' field to 'true' as mentioned in this document. This will automatically restart your instance if it crashes due to a hardware issue or after a live migration. This will ensure that the SSH keys were set up correctly. Feel free to read the live migration documentation if you need further information about live migration in Google Cloud Platform.