Slurm Nodes – Fix Random Dropping Issue

clusterslurm

I've set up a cluster using Slurm, consisting of a head node, 16 compute nodes, and an NAS with NFS-4 network shared storage. I've recently installed Slurm on Ubuntu v22 via apt (sinfo -V reveals slurm-wlm 21.08.5). I've tested with some single-node and multi-node jobs, and I can get jobs to run to completion as one would expect. However, for some simulations, some nodes keep changing state to down midway through the simulations. It is the same two nodes that display this behavior, though seemingly randomly. It is happening more often than not, but I believe we have had a few simulations finish using these nodes. On the nodes whose status changes to down, the slurmd daemon is still active—that is, whatever failure is happening is not due to the daemon going down.

Overall: Why are these nodes terminating jobs, and setting state to down?

More info: I've checked the slurmd log on one of the nodes which goes down, and this is what we get (from approximate time of job submission to failure / node down). Note that this is for a job (with ID=64) submitted to 4 nodes and all (64) processors per node:

[2023-12-07T16:48:29.487] [64.extern] debug2: setup for a launch_task
[2023-12-07T16:48:29.487] [64.extern] debug2: hwloc_topology_init
[2023-12-07T16:48:29.491] [64.extern] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:48:29.493] [64.extern] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:48:29.494] [64.extern] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2023-12-07T16:48:29.498] [64.extern] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:48:29.498] [64.extern] debug2: profile signaling type Task
[2023-12-07T16:48:29.499] [64.extern] debug:  Message thread started pid = 4176
[2023-12-07T16:48:29.503] [64.extern] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:48:29.507] [64.extern] debug:  task/cgroup: init: core enforcement enabled
[2023-12-07T16:48:29.507] [64.extern] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257579M allowed:100%(enforced), swap:0%(permissive), max:100%(257579M) max+swap:100%(515158M) min:30M kmem:100%(257579M permissive) min:30M swappiness:0(unset)
[2023-12-07T16:48:29.507] [64.extern] debug:  task/cgroup: init: memory enforcement enabled
[2023-12-07T16:48:29.509] [64.extern] debug:  task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-12-07T16:48:29.509] [64.extern] debug:  task/cgroup: init: device enforcement enabled
[2023-12-07T16:48:29.509] [64.extern] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:48:29.510] [64.extern] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:48:29.510] [64.extern] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:48:29.510] [64.extern] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:48:29.513] [64.extern] debug2: _spawn_job_container: Before call to spank_init()
[2023-12-07T16:48:29.513] [64.extern] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:48:29.513] [64.extern] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:48:29.513] [64.extern] debug2: _spawn_job_container: After call to spank_init()
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2023-12-07T16:48:29.556] [64.extern] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.556] [64.extern] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.556] [64.extern] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-12-07T16:48:29.579] [64.extern] debug2: adding task 3 pid 4185 on node 3 to jobacct
[2023-12-07T16:48:29.582] debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG
[2023-12-07T16:48:29.830] debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.830] debug2: Processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.830] launch task StepId=64.0 request from UID:1000 GID:1000 HOST:10.115.79.9 PORT:48642
[2023-12-07T16:48:29.830] debug:  Checking credential with 868 bytes of sig data
[2023-12-07T16:48:29.830] task/affinity: lllp_distribution: JobId=64 manual binding: none,one_thread
[2023-12-07T16:48:29.830] debug:  Waiting for job 64's prolog to complete
[2023-12-07T16:48:29.830] debug:  Finished wait for job 64's prolog to complete
[2023-12-07T16:48:29.839] debug2: debug level read from slurmd is 'debug2'.
[2023-12-07T16:48:29.839] debug2: read_slurmd_conf_lite: slurmd sent 8 TRES.
[2023-12-07T16:48:29.839] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-07T16:48:29.839] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-07T16:48:29.839] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-07T16:48:29.839] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-07T16:48:29.839] debug2: Received CPU frequency information for 64 CPUs
[2023-12-07T16:48:29.840] debug:  switch/none: init: switch NONE plugin loaded
[2023-12-07T16:48:29.840] debug:  switch Cray/Aries plugin loaded.
[2023-12-07T16:48:29.840] [64.0] debug2: setup for a launch_task
[2023-12-07T16:48:29.840] [64.0] debug2: hwloc_topology_init
[2023-12-07T16:48:29.845] [64.0] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:48:29.846] [64.0] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:48:29.847] [64.0] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2023-12-07T16:48:29.851] [64.0] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:48:29.852] [64.0] debug2: profile signaling type Task
[2023-12-07T16:48:29.852] [64.0] debug:  Message thread started pid = 4188
[2023-12-07T16:48:29.852] debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.857] [64.0] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:48:29.861] [64.0] debug:  task/cgroup: init: core enforcement enabled
[2023-12-07T16:48:29.861] [64.0] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257579M allowed:100%(enforced), swap:0%(permissive), max:100%(257579M) max+swap:100%(515158M) min:30M kmem:100%(257579M permissive) min:30M swappiness:0(unset)
[2023-12-07T16:48:29.861] [64.0] debug:  task/cgroup: init: memory enforcement enabled
[2023-12-07T16:48:29.863] [64.0] debug:  task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-12-07T16:48:29.863] [64.0] debug:  task/cgroup: init: device enforcement enabled
[2023-12-07T16:48:29.863] [64.0] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:48:29.863] [64.0] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:48:29.863] [64.0] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:48:29.863] [64.0] debug:  mpi type = none
[2023-12-07T16:48:29.863] [64.0] debug2: Before call to spank_init()
[2023-12-07T16:48:29.863] [64.0] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:48:29.864] [64.0] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:48:29.864] [64.0] debug2: After call to spank_init()
[2023-12-07T16:48:29.864] [64.0] debug:  mpi type = (null)
[2023-12-07T16:48:29.864] [64.0] debug:  mpi/none: p_mpi_hook_slurmstepd_prefork: mpi/none: slurmstepd prefork
[2023-12-07T16:48:29.864] [64.0] error: cpu_freq_cpuset_validate: cpu_bind string is null
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2023-12-07T16:48:29.883] [64.0] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.884] [64.0] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.884] [64.0] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-12-07T16:48:29.886] [64.0] debug2: hwloc_topology_load
[2023-12-07T16:48:29.918] [64.0] debug2: hwloc_topology_export_xml
[2023-12-07T16:48:29.922] [64.0] debug2: Entering _setup_normal_io
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: entering
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: msg->nodeid = 2
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: leaving
[2023-12-07T16:48:29.923] [64.0] debug2: Leaving  _setup_normal_io
[2023-12-07T16:48:29.923] [64.0] debug levels are stderr='error', logfile='debug2', syslog='quiet'
[2023-12-07T16:48:29.923] [64.0] debug:  IO handler started pid=4188
[2023-12-07T16:48:29.925] [64.0] starting 1 tasks
[2023-12-07T16:48:29.925] [64.0] task 2 (4194) started 2023-12-07T16:48:29
[2023-12-07T16:48:29.926] [64.0] debug:  Setting slurmstepd oom_adj to -1000
[2023-12-07T16:48:29.926] [64.0] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:48:29.959] [64.0] debug2: adding task 2 pid 4194 on node 2 to jobacct
[2023-12-07T16:48:29.960] [64.0] debug:  Sending launch resp rc=0
[2023-12-07T16:48:29.961] [64.0] debug:  mpi type = (null)
[2023-12-07T16:48:29.961] [64.0] debug:  mpi/none: p_mpi_hook_slurmstepd_task: Using mpi/none
[2023-12-07T16:48:29.961] [64.0] debug:  task/affinity: task_p_pre_launch: affinity StepId=64.0, task:2 bind:none,one_thread
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_STACK  : max:inf cur:inf req:8388608
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_STACK succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_CORE   : max:inf cur:inf req:0
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_CORE succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_NPROC  : max:1030021 cur:1030021 req:1030020
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_NOFILE : max:131072 cur:4096 req:1024
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_MEMLOCK: max:inf cur:inf req:33761472512
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_MEMLOCK succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
[2023-12-07T16:48:59.498] [64.extern] debug2: profile signaling type Task
[2023-12-07T16:48:59.852] [64.0] debug2: profile signaling type Task
[2023-12-07T16:51:03.457] debug:  Log file re-opened
[2023-12-07T16:51:03.457] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.457] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.457] debug2: hwloc_topology_init
[2023-12-07T16:51:03.462] debug2: hwloc_topology_load
[2023-12-07T16:51:03.480] debug2: hwloc_topology_export_xml
[2023-12-07T16:51:03.482] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:51:03.483] debug2: hwloc_topology_init
[2023-12-07T16:51:03.484] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:51:03.485] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:51:03.485] topology/none: init: topology NONE plugin loaded
[2023-12-07T16:51:03.485] route/default: init: route default plugin loaded
[2023-12-07T16:51:03.485] debug2: Gathering cpu frequency information for 64 cpus
[2023-12-07T16:51:03.487] debug:  Resource spec: No specialized cores configured by default on this node
[2023-12-07T16:51:03.487] debug:  Resource spec: Reserved system memory limit not configured for this node
[2023-12-07T16:51:03.490] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:51:03.490] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:51:03.490] debug:  auth/munge: init: Munge authentication plugin loaded
[2023-12-07T16:51:03.490] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:51:03.490] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:51:03.491] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:51:03.491] slurmd version 21.08.5 started
[2023-12-07T16:51:03.491] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:51:03.491] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:51:03.491] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: container_p_restore: job_container.conf read successfully
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 58 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/58/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 56 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/56/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.492] error: _restore_ns: failed to connect to stepd for 64.
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/64/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 54 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/54/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 59 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/59/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] error: Encountered an error while restoring job containers.
[2023-12-07T16:51:03.492] error: Unable to restore job_container state.
[2023-12-07T16:51:03.493] debug:  switch/none: init: switch NONE plugin loaded
[2023-12-07T16:51:03.493] debug:  switch Cray/Aries plugin loaded.
[2023-12-07T16:51:03.493] slurmd started on Thu, 07 Dec 2023 16:51:03 -0600
[2023-12-07T16:51:03.493] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.494] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.494] CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=257579 TmpDisk=937291 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-12-07T16:51:03.494] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.494] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.494] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-07T16:51:03.494] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-07T16:51:03.495] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-07T16:51:03.495] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-07T16:51:03.495] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
[2023-12-07T16:51:03.499] debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2023-12-07T16:51:03.500] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2023-12-07T16:51:03.500] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2023-12-07T16:51:03.500] debug:  _rpc_terminate_job: uid = 64030 JobId=64
[2023-12-07T16:51:03.500] debug:  credential for job 64 revoked
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 998
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 18
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 15
[2023-12-07T16:51:03.500] debug2: set revoke expiration for jobid 64 to 1701989583 UTS
[2023-12-07T16:51:03.501] error: _delete_ns: umount2 /var/nvme/storage/cn4/64/.ns failed: Invalid argument
[2023-12-07T16:51:03.501] error: container_g_delete(64): Invalid argument
[2023-12-07T16:51:03.501] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB

Immediately, I see some errors regarding /var/nvme/storage (this is a local folder on each node, not a network-shared location on the NAS), but this is the same across all nodes and is only causing issues on a couple of nodes. Note that this is the base path as set in job_container.conf:

AutoBasePath=true
BasePath=/var/nvme/storage

Additionally, here is cgroup.conf:

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
ConstrainKmemSpace=no
TaskAffinity=no
CgroupMountpoint=/sys/fs/cgroup

…and slurm.conf:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cauchy
SlurmctldHost=cauchy
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
MaxJobCount=1000000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
PrologFlags=contain
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=120
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
MaxArraySize=100000
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerParameters=enable_user_top,bf_job_part_count_reserve=5,bf_continue
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
####### Priority Begin ##################
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightAge=100
PriorityWeightPartition=10000
PriorityWeightJobSize=0
PriorityMaxAge=14-0
PriorityFavorSmall=YES
#PriorityWeightQOS=10000
#PriorityWeightTRES=cpu=2000,mem=1,gres/gpu=400
#AccountingStorageTRES=gres/gpu
#AccountingStorageEnforce=all
#FairShareDampeningFactor=5
####### Priority End ##################
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompParams=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobContainerType=job_container/tmpfs
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug2
SlurmdLogFile=/var/log/slurmd.log

#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=cn1 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn2 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn3 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn4 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn5 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn6 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn7 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn8 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn9 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn10 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn11 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn12 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn13 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn14 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn15 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn16 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000

PartitionName=main Nodes=ALL Default=YES MaxTime=INFINITE State=UP PriorityJobFactor=2000
PartitionName=low Nodes=ALL MaxTime=INFINITE State=UP PriorityJobFactor=1000

EDIT: I cancelled a job (submitted to 4 nodes, cn1 – cn4) that showed this error before it could reallocate to new nodes / overwrite the Slurm error file. Here are the contents of the error file:

Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
slurmstepd-cn1: error: *** JOB 74 ON cn1 CANCELLED AT 2023-12-10T10:05:57 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

The Authorization required... error is pervasive on all nodes / all simulations, so I'm not sure it is of particular consequence for the consistent failure of the cn4 node. The segfault doesn't appear when the same job is run on other nodes, so this is new info / anomalous. The slurmctld log is not particularly illuminating:

[2023-12-09T19:54:46.401] _slurm_rpc_submit_batch_job: JobId=65 InitPrio=10000 usec=493
[2023-12-09T19:54:46.826] sched/backfill: _start_job: Started JobId=65 in main on cn[1-4]
[2023-12-09T20:01:33.529] validate_node_specs: Node cn4 unexpectedly rebooted boot_time=1702173678 last response=1702173587
[2023-12-09T20:01:33.529] requeue job JobId=65 due to failure of node cn4
[2023-12-09T20:01:38.334] Requeuing JobId=65

Best Answer

I think this is mostly solved - still need to do some more testing, but things are mostly stable at this point.

The first issue was a hardware issue. The segfaults described above indicated RAM issues, so I started a suite of tests using memtest86+. Tests revealed lots of failures when all sticks were inserted, but individually each stick was fine. After reseating the CPU, memtest passed with all sticks seated. So, I think the initial problem was a poorly seated CPU.

The second issue was related to these types of errors:

[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 54 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/54/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 59 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/59/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] error: Encountered an error while restoring job containers.
[2023-12-07T16:51:03.492] error: Unable to restore job_container state.

Note that these errors are appearing during a job with ID=64. The directories that were left over for jobs 54 and 59 in the job container directory /var/nvme/storage/cn4/ were zombies from failed jobs that failed to clear automatically. I think this was confusing Slurm who was trying to restore their state, but they were long dead, which caused confusion... or that's the best I can surmise with my naive understanding.

Clearing the /var/nvme/storage/cn*/ directories on all nodes ended these types of errors, and this node in particular doesn't drop anymore (at least as frequently? Need more testing.). The overall question is why these zombie directories hang around, which is a nuisance. But, at least it is easily addressed.

Related Topic