Why does requesting GPUs as a generic resource on a cluster running SLURM with the built-in plugin fail

clusterhpcjob-scheduler

Disclaimer: This post is quite long as I tried to provide all relevant configuration information.

Status and Problem:

I adminster a gpu cluster and I want to use slurm for job management.
Unfortunatelly, I cannot request GPUs using the respective generic resources plugin of slurm.

Note: test.sh is a small script printing the environment variable CUDA_VISIBLE_DEVICES.

Running job with --gres=gpu:1 does not complete

Running srun -n1 --gres=gpu:1 test.sh results in the following error:

srun: error: Unable to allocate resources: Requested node configuration is not available

Log:

gres: gpu state for job 83
    gres_cnt:4 node_cnt:0 type:(null)
    _pick_best_nodes: job 83 never runnable
    _slurm_rpc_allocate_resources: Requested node configuration is not available

Running job with --gres=gram:500 does complete

If I call srun -n1 --gres=gram:500 test.sh however, the job runs and prints

CUDA_VISIBLE_DEVICES=NoDevFiles

Log:

sched: _slurm_rpc_allocate_resources JobId=76 NodeList=smurf01 usec=193
debug:  Configuration for job 76 complete
debug:  laying out the 1 tasks on 1 hosts smurf01 dist 1
job_complete: JobID=76 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
job_complete: JobID=76 State=0x8003 NodeCnt=1 done

Thus slurm seems to be correctly configured to run jobs using srun with requested generic resources using --gres but does not recognize the gpus for some reason.

My first idea was to use another name for the gpu generic resource as the other generic resources seem to work but I'd like to stick to the gpu plugin.

Configuration

The cluster has more than two slave hosts but for the sake of clarity i'll stick to two slightly differently configured slave hosts and the controller host: papa (controller), smurf01 and smurf02.ยด

slurm.conf

The generic-resrouce-relevant parts of the slurm configuration:

...
TaskPlugin=task/cgroup
...
GresTypes=gpu,ram,gram,scratch
...
NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
...

Note: ram is in GB, gram is in MB and scratch in GB again.

Output of scontrol show node

NodeName=smurf01 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi
   Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
   NodeAddr=192.168.1.101 NodeHostName=smurf01 Version=14.11
   OS=Linux RealMemory=1 AllocMem=0 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1
   BootTime=2015-04-23T13:58:15 SlurmdStartTime=2015-04-24T10:30:46
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=smurf02 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.01 Features=intel,fermi
   Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
   NodeAddr=192.168.1.102 NodeHostName=smurf02 Version=14.11
   OS=Linux RealMemory=1 AllocMem=0 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2015-04-23T13:57:56 SlurmdStartTime=2015-04-24T10:24:12
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

smurf01 configuration

GPUs

 > ls /dev | grep nvidia
nvidia0
... 
nvidia7
 > nvidia-smi | grep Tesla
|   0  Tesla M2090         On   | 0000:08:00.0     Off |                    0 |
... 
|   7  Tesla M2090         On   | 0000:1B:00.0     Off |                    0 |
...

gres.conf

Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0
Name=gpu Type=tesla File=/dev/nvidia1 CPUs=1
Name=gpu Type=tesla File=/dev/nvidia2 CPUs=2
Name=gpu Type=tesla File=/dev/nvidia3 CPUs=3
Name=gpu Type=tesla File=/dev/nvidia4 CPUs=4
Name=gpu Type=tesla File=/dev/nvidia5 CPUs=5
Name=gpu Type=tesla File=/dev/nvidia6 CPUs=6
Name=gpu Type=tesla File=/dev/nvidia7 CPUs=7
Name=ram Count=48
Name=gram Count=6000
Name=scratch Count=1300

smurf02 configuration

GPUs

Same configuration/output as smurf01.

gres.conf on smurf02

Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7]
Name=ram Count=48
Name=gram Count=6000
Name=scratch Count=1300

Note: The deamons have been restarted, the machines have been rebooted as well. The slurm and job submitting user have same ids/groups on slave and controller nodes and the munge authentication is working properly.

Log outputs

I added DebugFlags=Gres in the slurm.conf file and the GPUs seem to be recognized by the Plugin:

Controller log

gres / gpu: state for smurf01
   gres_cnt found : 8 configured : 8 avail : 8 alloc : 0
   gres_bit_alloc :
   gres_used : (null)
   topo_cpus_bitmap[0] : 0
   topo_gres_bitmap[0] : 0
   topo_gres_cnt_alloc[0] : 0
   topo_gres_cnt_avail[0] : 1
   type[0] : tesla
   topo_cpus_bitmap[1] : 1
   topo_gres_bitmap[1] : 1
   topo_gres_cnt_alloc[1] : 0
   topo_gres_cnt_avail[1] : 1
   type[1] : tesla
   topo_cpus_bitmap[2] : 2
   topo_gres_bitmap[2] : 2
   topo_gres_cnt_alloc[2] : 0
   topo_gres_cnt_avail[2] : 1
   type[2] : tesla
   topo_cpus_bitmap[3] : 3
   topo_gres_bitmap[3] : 3
   topo_gres_cnt_alloc[3] : 0
   topo_gres_cnt_avail[3] : 1
   type[3] : tesla
   topo_cpus_bitmap[4] : 4
   topo_gres_bitmap[4] : 4
   topo_gres_cnt_alloc[4] : 0
   topo_gres_cnt_avail[4] : 1
   type[4] : tesla
   topo_cpus_bitmap[5] : 5
   topo_gres_bitmap[5] : 5
   topo_gres_cnt_alloc[5] : 0
   topo_gres_cnt_avail[5] : 1
   type[5] : tesla
   topo_cpus_bitmap[6] : 6
   topo_gres_bitmap[6] : 6
   topo_gres_cnt_alloc[6] : 0
   topo_gres_cnt_avail[6] : 1
   type[6] : tesla
   topo_cpus_bitmap[7] : 7
   topo_gres_bitmap[7] : 7
   topo_gres_cnt_alloc[7] : 0
   topo_gres_cnt_avail[7] : 1
   type[7] : tesla
   type_cnt_alloc[0] : 0
   type_cnt_avail[0] : 8
   type[0] : tesla
...
gres/gpu: state for smurf02
   gres_cnt found:TBD configured:8 avail:8 alloc:0
   gres_bit_alloc:
   gres_used:(null)
   type_cnt_alloc[0]:0
   type_cnt_avail[0]:8
   type[0]:tesla

Slave log

Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = / dev / nvidia[0 - 7]
...
gpu 0 is device number 0
gpu 1 is device number 1
gpu 2 is device number 2
gpu 3 is device number 3
gpu 4 is device number 4
gpu 5 is device number 5
gpu 6 is device number 6
gpu 7 is device number 7

Best Answer

Slurm in the installed Version (14.11.5) seems to have problems with types assigned to the GPUs since removing Type=... from the gres.conf and changing the node configuration lines accordingly (to Gres=gpu:N,ram:...) results in successful execution of jobs requiring gpus via --gres=gpu:N.