I just need to clarify something very basic – with most of the computational examples using something like:
ID = blockIdx.x*blockDim.x+threadIdx.x;
// … then do computation on array[ID]
My question is that if I want to use the maximum number of thread in a block (1024) then do I really need to 'construct' my 'threadID' with consideration of all of (threadIdx.x
, threadIdx.y
, threadIdx.z
) ?
If so, what is a recommended way to hash it into a single value?
If not so, why can someone using it in a similar fashion in image-processing related operations such as in this post:
https://stackoverflow.com/questions/11503406/cuda-addressing-a-matrix
How about blockidx.x
and blockidx.y
, are they in the same shoes as the threaIdx
in this regard?
Best Answer
Creating 2D or 3D threadblocks is usually done because the problem lends itself to a 2D or 3D interpretation of the data, and handling it using a 2D or 3D threadblock may make the code more readable. But there's no specific reason why it cannot be done with a 1D threadblock with appropriate indexing.
Creating a 2D or 3D grid (of blocks) is usually done for the reason described above and/or to get around the limitation on pre CC 3.0 devices of the number of blocks in any one dimension of a grid (65535 max blocks in any dimension).
For the threadblock case, you can use 1024 threads in a single block in a single dimension, so you don't need to construct your ID variable with
threadIdx.y
orthreadIdx.z
if you don't want to.If you have a pre CC 3.0 device, and your problem is large enough in terms of blocks, you may still want to construct a 2D grid. You can still use 1D threadblocks in that grid. In that case, a unique ID variable can be created like:
The above construct should handle 1D threadblocks with any 2D grid.
There are other methods besides constructing a 2D grid to work with large problem sizes, such as having your blocks handle multiple chunks of data in a loop of some sort.