Using CUDA_VISIBLE_DEVICES with sge

cudagraphics-processing-unitgridengine

Using sge with resource complex called 'gpu.q' that allows resource management of gpu devices (these are all nvidia devices). However on the systems there are multiple gpu devices (in exclusive mode) and if two jobs are allocated on the same node there is no way for the user to opaquely create a context on the correct gpu.

Has anyone run into this problem ? I was thinking of somehow managing specific gpu resources and mapping the host and device id's. Something like

hostA -> gpu0:in_use
hostA -> gpu1:free
hostB -> gpu0:free
hostB -> gpu1:in_use

etc… And then upon resource request, reveal allocated gpu resources on each host through CUDA_VISIBLE_DEVICES variable.

This seems like a fairly common issue – it must have been solved by someone by now with the prevalence of gpu's in compute clusters.

Best Answer

As I found out the hard way, you can't just enumerate devices and then call cudaSetDevice(). CudaSetDevice() always succeeds if the device is present and you haven't created a context. The solution I worked out here with some tips from NVidians is use nvidia-smi to set the compute mode on all GPUs to process exclusive, and then to filter out devices that can't be used for your task with cudaSetValidDevices(), finally making a call to cudaFree() to force the CUDA driver to create a context on an available device.

If the call to cudaFree fails, there are no devices available:

// Let CUDA select any device from this list of device IDs filtered by your
// own criteria (not shown)
status                                      = cudaSetValidDevices(pGPUList, nGpus);
if (status != cudaSuccess)
{
    printf(("Error searching for compatible GPU\n");
    exit(-1);
}

// Trick driver into creating a context on an available and valid GPU
status                                      = cudaFree(0);
if (status != cudaSuccess)
{
    printf("Error selecting compatible GPU\n");
    exit(-1);
}

// Get device selected by driver
status                                      = cudaGetDevice(&device);
if (status != cudaSuccess)
{
    printf("Error fetching current GPU\n");
    exit(-1);
}

// Your amazing CUDA program goes here...

Note: if the GPUs aren't in exclusive mode, you'll need to manage them explicity from your queueing system somehow. The method described here would allow the use of a consumable resource to monitor all the tasks on a node to insure they never requested more GPUs than are available on it and then exploits exclusive mode to prevent collisions.

Best Answer

Related Solutions

Kill an SGE job “already in deletion”, as user

SGE – limit a user to a certain host, using resource quota configuration

Related Topic