I have an existing slurm cluster up and running but as of today without a configuration change I get an error when I run certain sacctmgr
commands and slurmdbd
crashes:
$ sacctmgr list associations
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to slurm.domain.com:6819: Connection refused
sacctmgr: error: slurmdbd: Getting response to message type 1410
sacctmgr: error: slurmdbd: DBD_GET_ASSOCS failure: Connection refused
Error with request: Connection refused
The systemctl status
shows:
Jul 03 10:01:46 slurm systemd[1]: slurmdbd.service: Main process exited, code=killed, status=11/SEGV
Jul 03 10:01:46 slurm systemd[1]: slurmdbd.service: Failed with result 'signal'.
and the slurmdbd.log says:
[2020-07-03T10:01:45.816] debug2: Opened connection 9 from 127.0.0.1
[2020-07-03T10:01:45.817] debug: REQUEST_PERSIST_INIT: CLUSTER:slurmcluster VERSION:8192 UID:0 IP:127.0.0.1 CONN:9
[2020-07-03T10:01:45.817] debug2: acct_storage_p_get_connection: request new connection 1
[2020-07-03T10:01:45.861] debug2: DBD_FINI: CLOSE:0 COMMIT:0
[2020-07-03T10:01:45.862] debug4: got 0 commits
[2020-07-03T10:01:45.949] debug2: DBD_GET_ASSOCS: called
[2020-07-03T10:01:45.950] debug4: 9(as_mysql_assoc.c:2032) query
call get_parent_limits('assoc_table', 'root', 'slurmcluster', 0); select @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos;
However other commands work (restart of slurmdbd needed after crash):
$ sacctmgr show cluster
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
slurmclus+ 127.0.0.1 6817 8192 1 normal
I can connect to the database and execute commands. Also, I can connect via telnet slurm.domain.com 6819
.
I'm using slurm 17.11.2 with MySQL 5.7 from the standard Ubuntu 18.04 repositories.
Best Answer
It turns out that the problem was an unattended upgrade. Therein MySQL was updated from
5.7.29
to5.7.30
. Everything works with MySQL5.7.29
. The changelog doesn't include something obvious, but according to the slurm-users mailinglist this is the problem: