Linux – Slurm: “Connection refused” for certain sacctmgr commands

linuxslurmubuntu-18.04

I have an existing slurm cluster up and running but as of today without a configuration change I get an error when I run certain sacctmgr commands and slurmdbd crashes:

$ sacctmgr list associations
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to slurm.domain.com:6819: Connection refused
sacctmgr: error: slurmdbd: Getting response to message type 1410
sacctmgr: error: slurmdbd: DBD_GET_ASSOCS failure: Connection refused
 Error with request: Connection refused

The systemctl status shows:

Jul 03 10:01:46 slurm systemd[1]: slurmdbd.service: Main process exited, code=killed, status=11/SEGV
Jul 03 10:01:46 slurm systemd[1]: slurmdbd.service: Failed with result 'signal'.

and the slurmdbd.log says:

[2020-07-03T10:01:45.816] debug2: Opened connection 9 from 127.0.0.1
[2020-07-03T10:01:45.817] debug:  REQUEST_PERSIST_INIT: CLUSTER:slurmcluster VERSION:8192 UID:0 IP:127.0.0.1 CONN:9
[2020-07-03T10:01:45.817] debug2: acct_storage_p_get_connection: request new connection 1
[2020-07-03T10:01:45.861] debug2: DBD_FINI: CLOSE:0 COMMIT:0
[2020-07-03T10:01:45.862] debug4: got 0 commits
[2020-07-03T10:01:45.949] debug2: DBD_GET_ASSOCS: called
[2020-07-03T10:01:45.950] debug4: 9(as_mysql_assoc.c:2032) query
call get_parent_limits('assoc_table', 'root', 'slurmcluster', 0); select @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos;

However other commands work (restart of slurmdbd needed after crash):

$ sacctmgr show cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
slurmclus+       127.0.0.1         6817  8192         1                                                                                           normal

I can connect to the database and execute commands. Also, I can connect via telnet slurm.domain.com 6819.

I'm using slurm 17.11.2 with MySQL 5.7 from the standard Ubuntu 18.04 repositories.

Best Answer

It turns out that the problem was an unattended upgrade. Therein MySQL was updated from 5.7.29 to 5.7.30. Everything works with MySQL 5.7.29. The changelog doesn't include something obvious, but according to the slurm-users mailinglist this is the problem:

Seems that (at least for the mysql procedure get_parent_limits) mySQL 5.7.30 returns NULL where mySQL 5.7.29 returned an empty string.

Related Topic