Linux – Entire proxmox Xen Node has grey question marks + database container gone

clusterlinuxMySQLperconaproxmox

Firstly, i've recently taken on the management of a proxmox cluster which I have had no experience managing previously (i'm completely new to cluster management, but not too bad at linux).

pve-manager/5.1-46/ae8241d4 (running kernel: 4.13.13-6-pve)

I have 2 xen nodes which run a number of containers and VMs within them. Yesterday, a container on Xen2, which runs a mysql database, stopped responding. I was able to log in to the container via ssh and attempted to restart mysql only to receive an error along the lines that it was unable to connect to the mysql.sock. So I decided to simply shutdown the container and start it back up. I chose 'shutdown' in proxmox UI for the container, which it then shutdown. Then I clicked 'start', in which proxmox logs recorded:

CT 110 - Start          ERROR: command 'systemctl start pve-container@110' failed: exit code 1

So, I've tried running the 'system start …' via ssh. It takes a while, and then I get the following:

Job for pve-container@110.service failed because a timeout was exceeded.
See "systemctl status pve-container@110.service" and "journalctl -xe" for details.

Here is the output of 'systemctl status …':

● pve-container@110.service - PVE LXC Container: 110
   Loaded: loaded (/lib/systemd/system/pve-container@.service; static; vendor preset: enabled)
   Active: failed (Result: timeout) since Thu 2018-06-07 08:35:22 BST; 43s ago
     Docs: man:lxc-start
           man:lxc
           man:pct
  Process: 1603366 ExecStart=/usr/bin/lxc-start -n 110 (code=killed, signal=TERM)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/system-pve\x2dcontainer.slice/pve-container@110.service
           └─1532500 [lxc monitor] /var/lib/lxc 110

Jun 07 08:33:52 xen2 systemd[1]: Starting PVE LXC Container: 110...
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Start operation timed out. Terminating.
Jun 07 08:35:22 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Unit entered failed state.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Failed with result 'timeout'.

and 'journalctl -xe':

Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Start operation timed out. Terminating.
Jun 07 08:35:22 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
-- Subject: Unit pve-container@110.service has failed
-- Defined-By: systemd
--
-- Unit pve-container@110.service has failed.
--
-- The result is failed.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Unit entered failed state.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Failed with result 'timeout'.

Shortly after attempting to restart the container the first time, the entire xen2 node started displaying grey questions marks along side all it's VM/containers and they lost their labels (see screenshot):

Despite this, all the other VMs/Containers within xen2 are still functioning fine. So, I've then decided to run the following commands to see what would happen:

service pvedaemon restart (nothing changed)
service pveproxy restart (nothing changed)
service pvestatd restart (The VMs started showing names within proxmox UI (but not containers), but this only lasted 10-15 minutes)

I'm hesitant to upgrade or restart the entire xen node due to the unknown side of configuration and what potential pitfalls may lie ahead and that its business critical to have at least something running. Furthermore, i've ran through /var/log/syslog and didn't see anything that indicated why the container crashed.

Ideally, I want to achieve:
Determine why the database container crashed (110)
Successfully start up the database container again
Determine why the xen2 node isn't reporting data to the UI about it's VM/Containers
Fix the reporting data in the UI for the node
Again, please appreciate i'm new to proxmox, but I do know my away around linux.

Thank you for any tips/knowledge on troubleshooting this problem. If there is any other info you'd like me to share, please let me know.

Cheers,
David

Best Answer

I've also suffered from a problem with similar symptoms (all nodes, VMs, and CTs go into an "unknown" status). Using the command line everything seemed fine and so it was more of a nuisance than anything because it meant I had to migrate everything and reboot each node individually before I could use the web UI again. I eventually figured out that restarting the following services on each node as follows fixes the problem:

systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd

I recommend dropping these in a script and running it with ./script.sh & to fork it off if you plan on using the web UI since this will disconnect your console session.

Related Solutions

MySQL/MariaDB – Troubleshooting Startup Issues

The issue is that the system cannot assign any memory to the process to be able to start, you either need to stop some other processes so that memory is free to be allocated to MariaDB or you need to add more RAM or SWAP space to the server so that the processes can start correctly.

Linux – How to Start etcd in Docker from systemd

Update: Posting test data and integrating updates based on comments received. docker -d is not required for systemd integration, as originally thought. Type= setting as Michael indicated seems to be in my experience, more important than offloading the daemonized status of a service to docker. The OP problem seemed on first blush to be a side effect of not having backgrounded, as I originally explained. This background seems irrelevant after having tested it further.

Note that the Amazon AWS image used in the OP, is not something that I can test or directly troubleshoot. A contrasting example for etcd and systemd are shown here to help with configuring the endpoint system similarly to mine. System details:

Ubuntu 20.04 LTS
docker 20.10.7
etcd 3.5.0

systemd configuration

I ended up with the following systemd service file. Note that Type=simple, owing to Michael suggesting to clarify this point in the response (and, apparently, my own understanding of this piece of the puzzle). You can learn more about systemd types here:

https://www.freedesktop.org/software/systemd/man/systemd.service.html

Type matters; More to the point my original understanding of simple as type, was myopically focused on the lack of communication back to systemd, which caused me to ignore the applicable behavior of what the type setting does in reaction to responses from the called application (in this case docker).

Removal of type, or addition of type to simple, will result in the same behavior regardless. The following configuration in my test worked reliably, as did -d being present or not in the docker run command:

[Unit]
Description=Docker container-etcd.service
Documentation=man:docker
Requires=docker.service
Wants=network.target
After=network-online.target

[Service]
ExecStartPre=- /usr/bin/docker stop etcd
ExecStartPre=- /usr/bin/docker rm etcd
ExecStart=docker run --rm -d -p 2379:2379 -p 2380:2380 --volume=/home/user/etcd-data:/etcd-data --name etcd quay.io/coreos/etcd:v3.5.0 /usr/local/bin/etcd --data-dir=/etcd-data --name etcd --initial-advertise-peer-urls http://10.4.4.132:2380 --listen-peer-urls http://0.0.0.0:2380 --advertise-client-urls http://10.4.4.132:2379 --listen-client-urls http://0.0.0.0:2379 --initial-cluster etcd=http://10.4.4.132:2380
ExecStop=/usr/bin/docker stop etcd -t 10
ExecRestart=/usr/bin/docker restart etcd
KillMode=none
RemainAfterExit=1
Restart=on-failure
Type=simple

[Install]
WantedBy=multi-user.target default.target

Notes

RemainAfterExit added, as systemd will consider the service exited after start if not present; The lack of this boolean creates a seemingly erroneous situation where docker ps shows the container running, but systemctl status container-etcd showed exited and inactive.
The systemd unit file is somewhat syntactically incorrect. %n is typically used for the Exec lines to refer to the service name (as in ...docker restart %n); I did not want to introduce further confusion while attempting to solve the OP's problem. Not to mention I went with etcd as the docker container name, versus container-etcd, as the unit service name.
ExecStart was collapsed to a one-line command. \ standard syntax did not work for me, nor did quoting the etcd call command to the container. My tests yesterday seemed to have worked fine, but today's configuration didn't behave the same way as yesterday. So I redid the test and configurations to find what seemed to be most stable for me.
Obviously, if you're going to be using docker rm at any point, you must or very strongly should use bind mounts, as stated in the OP and here with --volume. Personally I use full path locations, all stored under /srv, and then bind mount into the container. That way I have one folder to backup, and the state of the containers, present or not is irrelevant.

Confirmation

After updating systemd service file, doing a daemon-reload, etc., I execed into the container and ran a test command against etcd:

docker exec -it etcd sh
etcdctl --endpoints=http://10.4.4.132:2379 member list

Result

9a552f9b95628384, started, etcd, http://10.4.4.132:2380, http://10.4.4.132:2379, false

Best Answer

Related Solutions

MySQL/MariaDB – Troubleshooting Startup Issues

Linux – How to Start etcd in Docker from systemd

Related Topic