We have some fairly fat nodes in our SLURM cluster (e.g. 14 cores). I'm trying to configure it such that multiple batch jobs can be run in parallel, each requesting, for example, 3 cores. However, I can't get that to work.
Example batch job:
#!/bin/bash
#
#SBATCH --job-name=job1
#SBATCH --output=job1.txt
#
#SBATCH -c 3
#SBATCH -N 1
srun sleep 300
srun echo $HOSTNAME
Excerpt from the slurm.conf file:
TaskPlugin=task/cgroup
SelectType=select/cons_res
SelectTypeParameters=CR_CORE
NodeName=some-node NodeAddr=192.168.60.106 CPUs=12 State=UNKNOWN
But, if I run the two jobs, I get the following error:
sbatch: error: CPU count per node can not be satisfied
I found quite a few examples where it sais the sbatch -n
option is what controls the amount of CPUs or cores per batch job, however, that does not make sense to me since the documentation states:
Controls the number of tasks to be created for the job
If I try, it it just runs the jobs sequentially:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 16 mainpart job2 some-user PD 0:00 1 (Resources) 15 mainpart job1 some-user R 4:04 1 some-node
Best Answer
I had the same problem for days with SLURM running only one job per node no matter what I put into the batch files. The following combination of settings finally allowed me to get multiple batches running on a single node.
Before starting, ensure there are no jobs running and drop your nodes. See this answer for more on service vs systemctl for doing so on most linux systems.
In /etc/slurm-llnl/slurm.conf (location may differ)
This is obviously specific to one particular node, and yours will differ. But if the node is not configured correctly, SLURM can return errors about resources being unavailable. To get reliable information about your node, try the following on each node:
Then use its output to define each node in the controller's slurm.conf file. When things are set up, start SLURM back up again and send it some test batches to see if they spread out across the nodes properly.