Start etcd in docker from systemd

amazon-linux-2dockeretcdlinuxsystemd

I want to start etcd (single node) in docker from systemd, but something seems to go wrong – it gets terminated about 30 seconds after start.

It looks like the service starts in status "activating" but get terminated after about 30 seconds without reaching the status "active". Perhaps there are any missing signalling between docker container and systemd?

Update (see bottom of post): systemd service status reaches failed (Result: timeout) – when I remove the Restart=on-failure instruction.

When I check the status of the etcd service after boot, I get this result:

$ sudo systemctl status etcd● etcd.service - etcd   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Wed 2021-08-18 20:13:30 UTC; 4s ago
  Process: 2971 ExecStart=/usr/bin/docker run -p 2380:2380 -p 2379:2379 --volume=etcd-data:/etcd-data --name etcd my-aws-account.dkr.ecr.eu-north-1.amazonaws.com/etcd:v3.5.0 /usr/local/bin/etcd --data-dir=/etcd-data --name etcd0 --advertise-client-urls http://10.0.0.11:2379 --listen-client-urls http://0.0.0.0:2379 --initial-advertise-peer-urls http://10.0.0.11:2380 --listen-peer-urls http://0.0.0.0:2380 --initial-cluster etcd0=http://10.0.0.11:2380 (code=exited, status=125)
 Main PID: 2971 (code=exited, status=125)

I run this on an Amazon Linux 2 machine, with a user data script to run at launch. I have confirmed that docker.service and docker_ecr_login.service run successfully.

And short after launch of the machine, I can see that the etcd is running:

 sudo systemctl status etcd
● etcd.service - etcd
   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2021-08-18 20:30:07 UTC; 1min 20s ago
 Main PID: 1573 (docker)
    Tasks: 9
   Memory: 24.3M
   CGroup: /system.slice/etcd.service
           └─1573 /usr/bin/docker run -p 2380:2380 -p 2379:2379 --volume=etcd-data:/etcd-data --name etcd my-aws-account.dkr.ecr.eu-north-1.amazonaws.com...

Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.690Z","logger":"raft","caller":"...rm 2"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.691Z","caller":"etcdserver/serve..."3.5"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.693Z","caller":"membership/clust..."3.5"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.693Z","caller":"etcdserver/server.go:2...
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.693Z","caller":"api/capability.g..."3.5"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.693Z","caller":"etcdserver/serve..."3.5"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.693Z","caller":"embed/serve.go:9...ests"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.695Z","caller":"etcdmain/main.go...emon"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.695Z","caller":"etcdmain/main.go...emon"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.702Z","caller":"embed/serve.go:1...2379"}
Hint: Some lines were ellipsized, use -l to show in full.

I get the same behavior wether etcd listen to the Node IP (10.0.0.11) or 127.0.0.1.

I can run etcd locally, started from command line (and it does not terminate after 30 seconds), with:

sudo docker run -p 2380:2380 -p 2379:2379 --volume=etcd-data:/etcd-data --name etcd-local \
my-aws-account.dkr.ecr.eu-north-1.amazonaws.com/etcd:v3.5.0 \
/usr/local/bin/etcd --data-dir=/etcd-data \
--name etcd0 \
--advertise-client-urls http://127.0.0.1:2379 \
--listen-client-urls http://0.0.0.0:2379 \
--initial-advertise-peer-urls http://127.0.0.1:2380 \
--listen-peer-urls http://0.0.0.0:2380 \
--initial-cluster etcd0=http://127.0.0.1:2380

The parameters to etcd is similar to Running a single node etcd – ectd 3.5 documentation.

This is the relevant part of the startup script that is intended to launch etcd:

sudo docker volume create --name etcd-data

cat <<EOF | sudo tee /etc/systemd/system/etcd.service
[Unit]
Description=etcd
After=docker_ecr_login.service

[Service]
Type=notify
ExecStart=/usr/bin/docker run -p 2380:2380 -p 2379:2379 --volume=etcd-data:/etcd-data \
 --name etcd my-aws-account.dkr.ecr.eu-north-1.amazonaws.com/etcd:v3.5.0 \
 /usr/local/bin/etcd --data-dir=/etcd-data \
 --name etcd0 \
 --advertise-client-urls http://10.0.0.11:2379 \
 --listen-client-urls http://0.0.0.0:2379 \
 --initial-advertise-peer-urls http://10.0.0.11:2380 \
 --listen-peer-urls http://0.0.0.0:2380 \
 --initial-cluster etcd0=http://10.0.0.11:2380
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable etcd
sudo systemctl start etcd

When listing all containers on the machine, I can see that it has been running:

sudo docker ps -a
CONTAINER ID   IMAGE                                                       COMMAND                  CREATED          STATUS                      PORTS                          NAMES
a744aed0beb1   my-aws-account.dkr.ecr.eu-north-1.amazonaws.com/etcd:v3.5.0   "/usr/local/bin/etcd…"   25 minutes ago   Exited (0) 24 minutes ago                          etcd

but I suspect that it cannot be restarted since the container name already exists.

Why does the etcd container get terminated after ~30 seconds, when started from systemd? It appears like it successfully start, but systemd only shows it in status "activating" but never in status "active" and it seem to be terminated after about 30 seconds. Is there some missing signalling from the etcd docker container to systemd? If so, how do I get that signalling correct?


UPDATE:

After removing the Restart=on-failure instruction in the service unit file, I now get status: failed (Result: timeout):

$ sudo systemctl status etcd
● etcd.service - etcd
   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: failed (Result: timeout) since Wed 2021-08-18 21:35:54 UTC; 5min ago
  Process: 1567 ExecStart=/usr/bin/docker run -p 2380:2380 -p 2379:2379 --volume=etcd-data:/etcd-data --name etcd my-aws-account.dkr.ecr.eu-north-1.amazonaws.com/etcd:v3.5.0 /usr/local/bin/etcd --data-dir=/etcd-data --name etcd0 --advertise-client-urls http://127.0.0.1:2379 --listen-client-urls http://0.0.0.0:2379 --initial-advertise-peer-urls http://127.0.0.1:2380 --listen-peer-urls http://0.0.0.0:2380 --initial-cluster etcd0=http://127.0.0.1:2380 (code=exited, status=0/SUCCESS)
 Main PID: 1567 (code=exited, status=0/SUCCESS)

Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: {"level":"info","ts":"2021-08-18T21:35:54.332Z","caller":"osutil/interrupt...ated"}
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: {"level":"info","ts":"2021-08-18T21:35:54.333Z","caller":"embed/etcd.go:36...379"]}
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: WARNING: 2021/08/18 21:35:54 [core] grpc: addrConn.createTransport failed ...ing...
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: {"level":"info","ts":"2021-08-18T21:35:54.335Z","caller":"etcdserver/serve...6a6c"}
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: {"level":"info","ts":"2021-08-18T21:35:54.337Z","caller":"embed/etcd.go:56...2380"}
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: {"level":"info","ts":"2021-08-18T21:35:54.338Z","caller":"embed/etcd.go:56...2380"}
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: {"level":"info","ts":"2021-08-18T21:35:54.339Z","caller":"embed/etcd.go:36...379"]}
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal systemd[1]: Failed to start etcd.
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal systemd[1]: Unit etcd.service entered failed state.
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal systemd[1]: etcd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

Best Answer

Update: Posting test data and integrating updates based on comments received. docker -d is not required for systemd integration, as originally thought. Type= setting as Michael indicated seems to be in my experience, more important than offloading the daemonized status of a service to docker. The OP problem seemed on first blush to be a side effect of not having backgrounded, as I originally explained. This background seems irrelevant after having tested it further.

Note that the Amazon AWS image used in the OP, is not something that I can test or directly troubleshoot. A contrasting example for etcd and systemd are shown here to help with configuring the endpoint system similarly to mine. System details:

  • Ubuntu 20.04 LTS
  • docker 20.10.7
  • etcd 3.5.0

systemd configuration

I ended up with the following systemd service file. Note that Type=simple, owing to Michael suggesting to clarify this point in the response (and, apparently, my own understanding of this piece of the puzzle). You can learn more about systemd types here:

https://www.freedesktop.org/software/systemd/man/systemd.service.html

Type matters; More to the point my original understanding of simple as type, was myopically focused on the lack of communication back to systemd, which caused me to ignore the applicable behavior of what the type setting does in reaction to responses from the called application (in this case docker).

Removal of type, or addition of type to simple, will result in the same behavior regardless. The following configuration in my test worked reliably, as did -d being present or not in the docker run command:

[Unit]
Description=Docker container-etcd.service
Documentation=man:docker
Requires=docker.service
Wants=network.target
After=network-online.target

[Service]
ExecStartPre=- /usr/bin/docker stop etcd
ExecStartPre=- /usr/bin/docker rm etcd
ExecStart=docker run --rm -d -p 2379:2379 -p 2380:2380 --volume=/home/user/etcd-data:/etcd-data --name etcd quay.io/coreos/etcd:v3.5.0 /usr/local/bin/etcd --data-dir=/etcd-data --name etcd --initial-advertise-peer-urls http://10.4.4.132:2380 --listen-peer-urls http://0.0.0.0:2380 --advertise-client-urls http://10.4.4.132:2379 --listen-client-urls http://0.0.0.0:2379 --initial-cluster etcd=http://10.4.4.132:2380
ExecStop=/usr/bin/docker stop etcd -t 10
ExecRestart=/usr/bin/docker restart etcd
KillMode=none
RemainAfterExit=1
Restart=on-failure
Type=simple

[Install]
WantedBy=multi-user.target default.target

Notes

  • RemainAfterExit added, as systemd will consider the service exited after start if not present; The lack of this boolean creates a seemingly erroneous situation where docker ps shows the container running, but systemctl status container-etcd showed exited and inactive.
  • The systemd unit file is somewhat syntactically incorrect. %n is typically used for the Exec lines to refer to the service name (as in ...docker restart %n); I did not want to introduce further confusion while attempting to solve the OP's problem. Not to mention I went with etcd as the docker container name, versus container-etcd, as the unit service name.
  • ExecStart was collapsed to a one-line command. \ standard syntax did not work for me, nor did quoting the etcd call command to the container. My tests yesterday seemed to have worked fine, but today's configuration didn't behave the same way as yesterday. So I redid the test and configurations to find what seemed to be most stable for me.
  • Obviously, if you're going to be using docker rm at any point, you must or very strongly should use bind mounts, as stated in the OP and here with --volume. Personally I use full path locations, all stored under /srv, and then bind mount into the container. That way I have one folder to backup, and the state of the containers, present or not is irrelevant.

Confirmation

After updating systemd service file, doing a daemon-reload, etc., I execed into the container and ran a test command against etcd:

  • docker exec -it etcd sh
  • etcdctl --endpoints=http://10.4.4.132:2379 member list

Result

9a552f9b95628384, started, etcd, http://10.4.4.132:2380, http://10.4.4.132:2379, false