Calling a kernel from a kernel – Valuable Tech Notes

A follow up Q from: CUDA: Calling a __device__ function from a kernel

I'm trying to speed up a sort operation. A simplified pseudo version follows:

// some costly swap operation
__device__ swap(float* ptrA, float* ptrB){
  float saveData;         // swap some 
  saveData= *Adata;       //   big complex
  *Adata= *Bdata          //     data chunk
  *Bdata= saveData;
}

// a rather simple sort operation
__global__ sort(float data[]){
  for (i=0; i<limit: i++){
  find left swap point
  find right swap point
  swap<<<1,1>>>(left, right);
  }
}

(Note: This simple version doesn't show the reduction techniques in the blocks.)
The idea is that it is easy (fast) to identify the swap points. The swap operation is costly (slow). So use one block to find/identify the swap points. Use other blocks to do the swap operations. i.e. Do the actual swapping in parallel.
This sounds like a decent plan. But if the compiler in-lines the device calls, then there is no parallel swapping taking place.
Is there a way to tell the compiler to NOT in-line a device call?

Best Answer

It has been a long time that this question was asked. When I googled the same problem, I got to this page. Seems like I got the solution.

Solution:

I reached [here][1] somehow and saw the cool approach to launch kernel from within another kernel.

__global__ void kernel_child(float *var1, int N){
    //do data operations here
}


__global__ void kernel_parent(float *var1, int N)
{
    kernel_child<<<1,2>>>(var1,N);
}

The dynamic parallelism on cuda 5.0 and over made this possible. Also while running make sure you use compute_35 architecture or above.

Terminal way You can run the above parent kernel (which will eventually run child kernel) from termial. Verified on a Linux machine.

$ nvcc -arch=sm_35 -rdc=true yourFile.cu
$ ./a.out

Hope it helps. Thank you! [1]: http://developer.download.nvidia.com/assets/cuda/docs/TechBrief_Dynamic_Parallelism_in_CUDA_v2.pdf

Best Answer

Related Solutions

Reduce matrix rows with CUDA

Kernel Approach

Thrust Approach

cuBLAS Approach

Related Topic