CUDA Thread Addressing ((threadIdx.x, threadIdx.y, threadIdx.z) and block addressing (blockidx.x, blockidx.y)

cuda

I just need to clarify something very basic – with most of the computational examples using something like:

ID = blockIdx.x*blockDim.x+threadIdx.x;

// … then do computation on array[ID]

My question is that if I want to use the maximum number of thread in a block (1024) then do I really need to 'construct' my 'threadID' with consideration of all of (threadIdx.x, threadIdx.y, threadIdx.z) ?

If so, what is a recommended way to hash it into a single value?

If not so, why can someone using it in a similar fashion in image-processing related operations such as in this post:

https://stackoverflow.com/questions/11503406/cuda-addressing-a-matrix

How about blockidx.x and blockidx.y, are they in the same shoes as the threaIdx in this regard?

Best Answer

Creating 2D or 3D threadblocks is usually done because the problem lends itself to a 2D or 3D interpretation of the data, and handling it using a 2D or 3D threadblock may make the code more readable. But there's no specific reason why it cannot be done with a 1D threadblock with appropriate indexing.

Creating a 2D or 3D grid (of blocks) is usually done for the reason described above and/or to get around the limitation on pre CC 3.0 devices of the number of blocks in any one dimension of a grid (65535 max blocks in any dimension).

For the threadblock case, you can use 1024 threads in a single block in a single dimension, so you don't need to construct your ID variable with threadIdx.y or threadIdx.z if you don't want to.

If you have a pre CC 3.0 device, and your problem is large enough in terms of blocks, you may still want to construct a 2D grid. You can still use 1D threadblocks in that grid. In that case, a unique ID variable can be created like:

 int idx = threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x);

The above construct should handle 1D threadblocks with any 2D grid.

There are other methods besides constructing a 2D grid to work with large problem sizes, such as having your blocks handle multiple chunks of data in a loop of some sort.

Related Solutions

Cuda block synchronization

In CUDA 9, NVIDIA is introducing the concept of cooperative groups, allowing you to synchronize all threads belonging to that group. Such a group can span over all threads in the grid. This way you will be able to synchronize all threads in all blocks:

#include <cuda_runtime_api.h> 
#include <cuda.h> 
#include <cooperative_groups.h>

cooperative_groups::grid_group g = cooperative_groups::this_grid(); 
g.sync();

You need a Pascal (compute capability 60) or a newer architecture to synchronize grids. In addition, there are more specific requirements. See: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization-cg

Basic functionality, such as synchronizing groups smaller than a thread block down to warp granularity, is supported on all architectures, while Pascal and Volta GPUs enable new grid-wide and multi-GPU synchronizing groups.

Source: https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

Before CUDA 9, there was no native way to synchronise all threads from all blocks. In fact, the concept of blocks in CUDA is that some may be launched only after some other blocks already ended its work, for example, if the GPU it is running on is too weak to process them all in parallel.

If you ensure that you don't spawn too many blocks, you can try to synchronise all blocks between themselves, e.g. by actively-waiting using atomic operations. This is however slow, eating up your GPU memory controller, is considered "a hack" and should be avoided.

So, if you don't target Pascal (or newer) architecture, the best way that I can suggest is to simply terminate your kernel at the synchronisation point, and then launch a new kernel which would continue with your job. In most circumstances it will actually perform faster (or at least - with simmilar speeds) than using the mentioned hack.

How to get the CUDA version

As Jared mentions in a comment, from the command line:

nvcc --version

(or /usr/local/cuda/bin/nvcc --version) gives the CUDA compiler version (which matches the toolkit version).

From application code, you can query the runtime API version with

cudaRuntimeGetVersion()

or the driver API version with

cudaDriverGetVersion()

As Daniel points out, deviceQuery is an SDK sample app that queries the above, along with device capabilities.

As others note, you can also check the contents of the version.txt using (e.g., on Mac or Linux)

cat /usr/local/cuda/version.txt

However, if there is another version of the CUDA toolkit installed other than the one symlinked from /usr/local/cuda, this may report an inaccurate version if another version is earlier in your PATH than the above, so use with caution.

Best Answer

Related Solutions

Cuda block synchronization

How to get the CUDA version

Related Topic