CUDA Thread Addressing ((threadIdx.x, threadIdx.y, threadIdx.z) and block addressing (blockidx.x, blockidx.y)

cuda

I just need to clarify something very basic – with most of the computational examples using something like:

ID = blockIdx.x*blockDim.x+threadIdx.x;

// … then do computation on array[ID]

My question is that if I want to use the maximum number of thread in a block (1024) then do I really need to 'construct' my 'threadID' with consideration of all of (threadIdx.x, threadIdx.y, threadIdx.z) ?

If so, what is a recommended way to hash it into a single value?

If not so, why can someone using it in a similar fashion in image-processing related operations such as in this post:

https://stackoverflow.com/questions/11503406/cuda-addressing-a-matrix

How about blockidx.x and blockidx.y, are they in the same shoes as the threaIdx in this regard?

Best Answer

Creating 2D or 3D threadblocks is usually done because the problem lends itself to a 2D or 3D interpretation of the data, and handling it using a 2D or 3D threadblock may make the code more readable. But there's no specific reason why it cannot be done with a 1D threadblock with appropriate indexing.

Creating a 2D or 3D grid (of blocks) is usually done for the reason described above and/or to get around the limitation on pre CC 3.0 devices of the number of blocks in any one dimension of a grid (65535 max blocks in any dimension).

For the threadblock case, you can use 1024 threads in a single block in a single dimension, so you don't need to construct your ID variable with threadIdx.y or threadIdx.z if you don't want to.

If you have a pre CC 3.0 device, and your problem is large enough in terms of blocks, you may still want to construct a 2D grid. You can still use 1D threadblocks in that grid. In that case, a unique ID variable can be created like:

 int idx = threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x);  

The above construct should handle 1D threadblocks with any 2D grid.

There are other methods besides constructing a 2D grid to work with large problem sizes, such as having your blocks handle multiple chunks of data in a loop of some sort.