I'm trying to break apart and reshape the structure of an array asynchronously using the CUDA kernel. memcpy()
doesn't work inside the kernel, and neither does cudaMemcpy()
*; I'm at a loss.
Can anyone tell me the preferred method for copying memory from within the CUDA kernel?
It is worth noting, cudaMemcpy(void *to, void *from, size, cudaMemcpyDeviceToDevice)
will NOT work for what I am trying to do, because it can only be called from outside of the kernel and does not execute asynchronously.
Best Answer
Yes, there is an equivalent to
memcpy
that works inside cuda kernels. It is calledmemcpy
. As an example:which compiles without error like this:
and emits PTX:
The code block at
BB0_3
is a byte sizedmemcpy
loop emitted automagically by the compiler. It might not be a great idea from a performance point-of-view to use it, but it is fully supported (and has been for a long time on all architectures).Edited four years later to add that since the device side runtime API was released as part of the CUDA 6 release cycle, it is also possible to directly call something like
in device code for all architectures which support it (Compute Capability 3.5 and newer hardware using separate compilation and device linking).