Linux – Entire proxmox Xen Node has grey question marks + database container gone

clusterlinuxMySQLperconaproxmox

Firstly, i've recently taken on the management of a proxmox cluster which I have had no experience managing previously (i'm completely new to cluster management, but not too bad at linux).

pve-manager/5.1-46/ae8241d4 (running kernel: 4.13.13-6-pve)

I have 2 xen nodes which run a number of containers and VMs within them. Yesterday, a container on Xen2, which runs a mysql database, stopped responding. I was able to log in to the container via ssh and attempted to restart mysql only to receive an error along the lines that it was unable to connect to the mysql.sock. So I decided to simply shutdown the container and start it back up. I chose 'shutdown' in proxmox UI for the container, which it then shutdown. Then I clicked 'start', in which proxmox logs recorded:

CT 110 - Start          ERROR: command 'systemctl start pve-container@110' failed: exit code 1

So, I've tried running the 'system start …' via ssh. It takes a while, and then I get the following:

Job for pve-container@110.service failed because a timeout was exceeded.
See "systemctl status pve-container@110.service" and "journalctl -xe" for details.

Here is the output of 'systemctl status …':

● pve-container@110.service - PVE LXC Container: 110
   Loaded: loaded (/lib/systemd/system/pve-container@.service; static; vendor preset: enabled)
   Active: failed (Result: timeout) since Thu 2018-06-07 08:35:22 BST; 43s ago
     Docs: man:lxc-start
           man:lxc
           man:pct
  Process: 1603366 ExecStart=/usr/bin/lxc-start -n 110 (code=killed, signal=TERM)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/system-pve\x2dcontainer.slice/pve-container@110.service
           └─1532500 [lxc monitor] /var/lib/lxc 110

Jun 07 08:33:52 xen2 systemd[1]: Starting PVE LXC Container: 110...
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Start operation timed out. Terminating.
Jun 07 08:35:22 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Unit entered failed state.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Failed with result 'timeout'.

and 'journalctl -xe':

Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Start operation timed out. Terminating.
Jun 07 08:35:22 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
-- Subject: Unit pve-container@110.service has failed
-- Defined-By: systemd
--
-- Unit pve-container@110.service has failed.
--
-- The result is failed.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Unit entered failed state.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Failed with result 'timeout'.

Shortly after attempting to restart the container the first time, the entire xen2 node started displaying grey questions marks along side all it's VM/containers and they lost their labels (see screenshot):

enter image description here

Despite this, all the other VMs/Containers within xen2 are still functioning fine. So, I've then decided to run the following commands to see what would happen:

service pvedaemon restart (nothing changed)
service pveproxy restart (nothing changed)
service pvestatd restart (The VMs started showing names within proxmox UI (but not containers), but this only lasted 10-15 minutes)

I'm hesitant to upgrade or restart the entire xen node due to the unknown side of configuration and what potential pitfalls may lie ahead and that its business critical to have at least something running. Furthermore, i've ran through /var/log/syslog and didn't see anything that indicated why the container crashed.

Ideally, I want to achieve:
Determine why the database container crashed (110)
Successfully start up the database container again
Determine why the xen2 node isn't reporting data to the UI about it's VM/Containers
Fix the reporting data in the UI for the node
Again, please appreciate i'm new to proxmox, but I do know my away around linux.

Thank you for any tips/knowledge on troubleshooting this problem. If there is any other info you'd like me to share, please let me know.

Cheers,
David

Best Answer

I've also suffered from a problem with similar symptoms (all nodes, VMs, and CTs go into an "unknown" status). Using the command line everything seemed fine and so it was more of a nuisance than anything because it meant I had to migrate everything and reboot each node individually before I could use the web UI again. I eventually figured out that restarting the following services on each node as follows fixes the problem:

systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd

I recommend dropping these in a script and running it with ./script.sh & to fork it off if you plan on using the web UI since this will disconnect your console session.

Related Topic