Using sge with resource complex called 'gpu.q' that allows resource management of gpu devices (these are all nvidia devices). However on the systems there are multiple gpu devices (in exclusive mode) and if two jobs are allocated on the same node there is no way for the user to opaquely create a context on the correct gpu.
Has anyone run into this problem ? I was thinking of somehow managing specific gpu resources and mapping the host and device id's. Something like
hostA -> gpu0:in_use
hostA -> gpu1:free
hostB -> gpu0:free
hostB -> gpu1:in_use
etc… And then upon resource request, reveal allocated gpu resources on each host through CUDA_VISIBLE_DEVICES variable.
This seems like a fairly common issue – it must have been solved by someone by now with the prevalence of gpu's in compute clusters.
Best Answer
As I found out the hard way, you can't just enumerate devices and then call cudaSetDevice(). CudaSetDevice() always succeeds if the device is present and you haven't created a context. The solution I worked out here with some tips from NVidians is use nvidia-smi to set the compute mode on all GPUs to process exclusive, and then to filter out devices that can't be used for your task with cudaSetValidDevices(), finally making a call to cudaFree() to force the CUDA driver to create a context on an available device.
If the call to cudaFree fails, there are no devices available:
Note: if the GPUs aren't in exclusive mode, you'll need to manage them explicity from your queueing system somehow. The method described here would allow the use of a consumable resource to monitor all the tasks on a node to insure they never requested more GPUs than are available on it and then exploits exclusive mode to prevent collisions.