The CUDA runtime makes it possible to compile and link your CUDA kernels into executables. This means that you don't have to distribute cubin files with your application, or deal with loading them through the driver API. As you have noted, it is generally easier to use.
In contrast, the driver API is harder to program but provided more control over how CUDA is used. The programmer has to directly deal with initialization, module loading, etc.
Apparently more detailed device information can be queried through the driver API than through the runtime API. For instance, the free memory available on the device can be queried only through the driver API.
From the CUDA Programmer's Guide:
It is composed of two APIs:
- A low-level API called the CUDA driver API,
- A higher-level API called the CUDA runtime API that is implemented on top of
the CUDA driver API.
These APIs are mutually exclusive: An application should use either one or the
other.
The CUDA runtime eases device code management by providing implicit
initialization, context management, and module management. The C host code
generated by nvcc is based on the CUDA runtime (see Section 4.2.5), so
applications that link to this code must use the CUDA runtime API.
In contrast, the CUDA driver API requires more code, is harder to program and
debug, but offers a better level of control and is language-independent since it only
deals with cubin objects (see Section 4.2.5). In particular, it is more difficult to
configure and launch kernels using the CUDA driver API, since the execution
configuration and kernel parameters must be specified with explicit function calls
instead of the execution configuration syntax described in Section 4.2.3. Also, device
emulation (see Section 4.5.2.9) does not work with the CUDA driver API.
There is no noticeable performance difference between the API's. How your kernels use memory and how they are laid out on the GPU (in warps and blocks) will have a much more pronounced effect.
You need to ensure that your driver version matches or exceeds your CUDA Toolkit version.
For 2.3 you need a 190.x driver, for 3.0 you need 195.x and for 3.1 you need 256.x (actually anything up to the next multiple of five is ok, e.g. 258.x for 3.1).
You can check your driver version by either running the deviceQueryDrv SDK sample or go into the NVIDIA Control Panel and choose System Information.
Download an updated driver from www.nvidia.com/drivers.
Best Answer
Probably the best way to check for errors in runtime API code is to define an assert style handler function and wrapper macro like this:
You can then wrap each API call with the
gpuErrchk
macro, which will process the return status of the API call it wraps, for example:If there is an error in a call, a textual message describing the error and the file and line in your code where the error occurred will be emitted to
stderr
and the application will exit. You could conceivably modifygpuAssert
to raise an exception rather than callexit()
in a more sophisticated application if it were required.A second related question is how to check for errors in kernel launches, which can't be directly wrapped in a macro call like standard runtime API calls. For kernels, something like this:
will firstly check for invalid launch argument, then force the host to wait until the kernel stops and checks for an execution error. The synchronisation can be eliminated if you have a subsequent blocking API call like this:
in which case the
cudaMemcpy
call can return either errors which occurred during the kernel execution or those from the memory copy itself. This can be confusing for the beginner, and I would recommend using explicit synchronisation after a kernel launch during debugging to make it easier to understand where problems might be arising.Note that when using CUDA Dynamic Parallelism, a very similar methodology can and should be applied to any usage of the CUDA runtime API in device kernels, as well as after any device kernel launches: