Google Cloud – Failed Instance in Google Compute Engine

google-cloud-platformgoogle-compute-enginelinux

I have an GCE instance which has been running for several years. During night, the instance was restarted with following logs:

2022-02-13 04:46:36.370 CET compute.instances.hostError Instance terminated by Compute Engine.
2022-02-13 04:47:08.279 CET compute.instances.automaticRestart Instance automatically restarted by Compute Engine.

However the instance did not restart.

I can connect to the serial console where I see this:

serialport: Connected to ***.europe-west1-b.*** port 1 (
[ TIME ] Timed out waiting for device ***
[DEPEND] Dependency failed for File… ***.
[DEPEND] Dependency failed for /data.
[DEPEND] Dependency failed for Local File Systems.
[  OK  ] Stopped Dispatch Password …ts to Console Directory Watch.
[  OK  ] Stopped Forward Password R…uests to Wall Directory Watch.
[  OK  ] Reached target Timers.
         Starting Raise network interfaces...
[  OK  ] Closed Syslog Socket.
[  OK  ] Reached target Login Prompts.
[  OK  ] Reached target Paths.
[  OK  ] Reached target Sockets.
[  OK  ] Started Emergency Shell.
[  OK  ] Reached target Emergency Mode.
         Starting Create Volatile Files and Directories...
[  OK  ] Finished Create Volatile Files and Directories.
         Starting Network Time Synchronization...
         Starting Update UTMP about System Boot/Shutdown...
[  OK  ] Finished Update UTMP about System Boot/Shutdown.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Finished Update UTMP about System Runlevel Changes.
[  OK  ] Started Network Time Synchronization.
[  OK  ] Reached target System Time Set.
[  OK  ] Reached target System Time Synchronized.
         Stopping Network Time Synchronization...
[  OK  ] Stopped Network Time Synchronization.
         Starting Network Time Synchronization...
[  OK  ] Started Network Time Synchronization.
[  OK  ] Finished Raise network interfaces.
[  OK  ] Reached target Network.
[  OK  ] Reached target Network is Online.
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to r
Cannot open access to console, the root account is locked.
See sulogin(8) man page for more details.
Press Enter to continue.

It seems that one of the disks cannot be connected – but what can I do about it now? The disk seems to be normally available within the compute engine.

Best Answer

I am afraid that you cannot do anything with this affected VM.

In Host Events documentation or FAQ you can find information:

A host error (compute.instances.hostError) means that there was a hardware or software issue on the physical machine hosting your VM that caused your VM to crash. A host error which involves total hardware failure or other hardware issues might prevent live migration of your VM.

VM instance which is in the "Cloud", it's still a physical machine that is running your workload. Unfortunately this instance had a hardware or software failure and there is nothing you can do.

GCP introduced something called Live migration which prevents this kind of situation.

Compute Engine offers live migration to keep your virtual machine instances running even when a host system event, such as a software or hardware update, occurs, however I guess it's too late to configure this one.

...

Live migration keeps your instances running during:

  • Regular infrastructure maintenance and upgrades.
  • Network and power grid maintenance in the data centers.
  • Failed hardware such as memory, CPU, network interface cards, disks, power, and so on. This is done on a best-effort basis; if a hardware fails completely or otherwise prevents live migration, the VM crashes and restarts automatically and a hostError is logged.

...

Live migration does not change any attributes or properties of the VM itself. The live migration process just transfers a running VM from one host machine to another host machine within the same zone.

Possible Workaround

As you mention that disks are persistent and still visible in the GCP, you could try to reattach them to another VM. How to Guide can be found in Creating and attaching a disk documentation.

Related Topic