Howto set up SGE for CUDA devices

cudagridengine

I'm currently facing the problem of integrating GPU-Servers into an existing SGE environment. Using google I found some examples of Clusters where this has been set up but no information on how this had been done.

Is there some form of howto or tutorial on this anywhere? It doesn't have to be ultra verbose but it should contain enough information to get a "cuda queue" up and running…

Thanks in advance…

Edit: To set up a load sensor about how many GPUs in a node are free, I've done the following:

  • set the compute mode of the GPUs to exclusive
  • set the GPUs to persistent mode
  • add the following script to the cluster configuration as load sensor (and set it so 1 sec.)
#!/bin/sh

hostname=`uname -n`

while [ 1 ]; do
  read input
  result=$?
  if [ $result != 0 ]; then
    exit 1
  fi
  if [ "$input" == "quit" ]; then
    exit 0
  fi


  smitool=`which nvidia-smi`
  result=$?
  if [ $result != 0 ]; then
    gpusav=0
    gpus=0
  else
    gpustotal=`nvidia-smi -L|wc -l`
    gpusused=`nvidia-smi |grep "Process name" -A 6|grep -v +-|grep -v \|=|grep -v Usage|grep -v "No running"|wc -l`
    gpusavail=`echo $gpustotal-$gpusused|bc`
  fi

  echo begin
  echo "$hostname:gpu:$gpusavail"
  echo end
done

exit 0

Note: This obviously works only for NVIDIA GPUs

Best Answer

The strategy is actually fairly simple.

Using qconf -mc you can create a complex resource called gpu (or whatever you wish to name it). The resource definition should look something like:

#name               shortcut   type        relop   requestable consumable default  urgency     
#----------------------------------------------------------------------------------------------
gpu                 gpu        INT         <=      YES         YES        0        0

Then you should edit your exec host definitions with qconf -me to set the number of GPUs on exec hosts that have them:

hostname              node001
load_scaling          NONE
complex_values        gpu=2
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE

Now that you've set up your exec hosts, you can request gpu resources when submitting jobs. eg: qsub -l gpu=1 and gridengine will keep track of how many GPUs are available.

If you have more than one job running per node that uses a GPU you may want to place your GPUs in to exclusive mode. You can do this with the nvidia-smi utility.