A follow up Q from: CUDA: Calling a __device__ function from a kernel
I'm trying to speed up a sort operation. A simplified pseudo version follows:
// some costly swap operation
__device__ swap(float* ptrA, float* ptrB){
float saveData; // swap some
saveData= *Adata; // big complex
*Adata= *Bdata // data chunk
*Bdata= saveData;
}
// a rather simple sort operation
__global__ sort(float data[]){
for (i=0; i<limit: i++){
find left swap point
find right swap point
swap<<<1,1>>>(left, right);
}
}
(Note: This simple version doesn't show the reduction techniques in the blocks.)
The idea is that it is easy (fast) to identify the swap points. The swap operation is costly (slow). So use one block to find/identify the swap points. Use other blocks to do the swap operations. i.e. Do the actual swapping in parallel.
This sounds like a decent plan. But if the compiler in-lines the device calls, then there is no parallel swapping taking place.
Is there a way to tell the compiler to NOT in-line a device call?
Best Answer
It has been a long time that this question was asked. When I googled the same problem, I got to this page. Seems like I got the solution.
Solution:
I reached [here][1] somehow and saw the cool approach to launch kernel from within another kernel.
The dynamic parallelism on cuda 5.0 and over made this possible. Also while running make sure you use compute_35 architecture or above.
Terminal way You can run the above parent kernel (which will eventually run child kernel) from termial. Verified on a Linux machine.
Hope it helps. Thank you! [1]: http://developer.download.nvidia.com/assets/cuda/docs/TechBrief_Dynamic_Parallelism_in_CUDA_v2.pdf