CUDA: Calling a device function from a kernel

cuda

I have a kernel that calls a device function inside an if statement. The code is as follows:

__device__ void SetValues(int *ptr,int id)
{
    if(ptr[threadIdx.x]==id) //question related to here
          ptr[threadIdx.x]++;
}

__global__ void Kernel(int *ptr)
{
    if(threadIdx.x<2)
         SetValues(ptr,threadIdx.x);
}

In the kernel threads 0-1 call SetValues concurrently. What happens after that? I mean there are now 2 concurrent calls to SetValues. Does every function call execute serially? So they behave like 2 kernel function calls?

Best Answer

CUDA actually inlines all functions by default (although Fermi and newer architectures do also support a proper ABI with function pointers and real function calls). So your example code gets compiled to something like this

__global__ void Kernel(int *ptr)
{
    if(threadIdx.x<2)
        if(ptr[threadIdx.x]==threadIdx.x)
            ptr[threadIdx.x]++;
}

Execution happens in parallel, just like normal code. If you engineer a memory race into a function, there is no serialization mechanism that can save you.

Related Solutions

CUDA allocate memory in device function

According to http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf you should be able to use malloc() and free() in a device function.

Page 122

B.15 Dynamic Global Memory Allocation void* malloc(size_t size); void free(void* ptr); allocate and free memory dynamically from a fixed-size heap in global memory.

The example given in the manual.

__global__ void mallocTest()
{
    char* ptr = (char*)malloc(123);
    printf(“Thread %d got pointer: %p\n”, threadIdx.x, ptr);
    free(ptr);
}

void main()
{
    // Set a heap size of 128 megabytes. Note that this must
    // be done before any kernel is launched.
    cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
    mallocTest<<<1, 5>>>();
    cudaThreadSynchronize();
}

You need the compiler paramter -arch=sm_20 and a card that supports >2x architecture.

How to get the CUDA version

As Jared mentions in a comment, from the command line:

nvcc --version

(or /usr/local/cuda/bin/nvcc --version) gives the CUDA compiler version (which matches the toolkit version).

From application code, you can query the runtime API version with

cudaRuntimeGetVersion()

or the driver API version with

cudaDriverGetVersion()

As Daniel points out, deviceQuery is an SDK sample app that queries the above, along with device capabilities.

As others note, you can also check the contents of the version.txt using (e.g., on Mac or Linux)

cat /usr/local/cuda/version.txt

However, if there is another version of the CUDA toolkit installed other than the one symlinked from /usr/local/cuda, this may report an inaccurate version if another version is earlier in your PATH than the above, so use with caution.

Best Answer

Related Solutions

CUDA allocate memory in __device__ function

How to get the CUDA version

Related Topic

CUDA allocate memory in device function