CUDA actually inlines all functions by default (although Fermi and newer architectures do also support a proper ABI with function pointers and real function calls). So your example code gets compiled to something like this
__global__ void Kernel(int *ptr)
{
if(threadIdx.x<2)
if(ptr[threadIdx.x]==threadIdx.x)
ptr[threadIdx.x]++;
}
Execution happens in parallel, just like normal code. If you engineer a memory race into a function, there is no serialization mechanism that can save you.
As Jared mentions in a comment, from the command line:
nvcc --version
(or /usr/local/cuda/bin/nvcc --version
) gives the CUDA compiler version (which matches the toolkit version).
From application code, you can query the runtime API version with
cudaRuntimeGetVersion()
or the driver API version with
cudaDriverGetVersion()
As Daniel points out, deviceQuery is an SDK sample app that queries the above, along with device capabilities.
As others note, you can also check the contents of the version.txt
using (e.g., on Mac or Linux)
cat /usr/local/cuda/version.txt
However, if there is another version of the CUDA toolkit installed other than the one symlinked from /usr/local/cuda
, this may report an inaccurate version if another version is earlier in your PATH
than the above, so use with caution.
Best Answer
yes, just mark function with
__device__
and it will be callable only from GPU. Check CUDA Programming guide, section B.1 Here is the direct link