Why does Slurm fail to start with systemd but work when starting manually

slurmubuntu-18.04

I've just set up slurm where one physical machine will be the only system in the cluster (so far). This is on Ubuntu 18.04.

I have slurmdbd running, but when I attempt to start up slurmd and slurmctld this times out. Why?

I'm issuing the following commands:

systemctl start slurmctld
systemctl start slurmd

I've also tried:

systemctl start slurmctld slurmd

and:

systemctl start slurmd slurmctld

This fails with the following, for slurmctld:

systemd[1]: slurmd.service: Can't open PID file /var/run/slurm-llnl/slurm-llnl/slurmd.pid (yet?) after start: No such file or directory
systemd[1]: slurmctld.service: Start operation timed out. Terminating.
systemd[1]: slurmctld.service: Failed with result 'timeout'.
systemd[1]: Failed to start Slurm controller daemon.

And for slurmd:

systemd[1]: slurmd.service: Start operation timed out. Terminating.
systemd[1]: slurmd.service: Failed with result 'timeout'.
systemd[1]: Failed to start Slurm node daemon.

However, when I start these manually (using two terminals) by issuing:

slurmctld -Dvvv
slurmd -Dvvv

Everything appears to work.

Why is this? How am I supposed to start slurm?

These are the service files (which should be standard, I didn't touch them except for adding verbose arguments, but then removing them again later):

# cat /lib/systemd/system/slurmd.service 
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm-llnl/slurm.conf
Documentation=man:slurmd(8)

[Service]
Type=forking
EnvironmentFile=-/etc/default/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm-llnl/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity

[Install]
WantedBy=multi-user.target
# cat /lib/systemd/system/slurmctld.service 
[Unit]
Description=Slurm controller daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm-llnl/slurm.conf
Documentation=man:slurmctld(8)

[Service]
Type=forking
EnvironmentFile=-/etc/default/slurmctld
ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm-llnl/slurmctld.pid

[Install]
WantedBy=multi-user.target

Best Answer

Look carefully at your log:

Can't open PID file /var/run/slurm-llnl/slurm-llnl/slurmd.pid

This path does not match the one declared in your /lib/systemd/system/slurmd.service. To fix it, field SlurmdPidFile in file /etc/slurm-llnl/slurm.conf should be corrected. The same goes for SlurmctldPidFile.

Note also that the easy configurator /usr/share/doc/slurm-wlm-doc/html/configurator.easy.html offers /var/run/slurmd.pid by default, which fails as well.

Related Topic