Cuda block/grid dimensions: when to use dim3

cudagpu

I need some clearing up regarding the use of dim3 to set the number of threads in my CUDA kernel.

I have an image in a 1D float array, which I'm copying to the device with:

checkCudaErrors(cudaMemcpy( img_d, img.data, img.row * img.col * sizeof(float), cudaMemcpyHostToDevice));

Now I need to set the grid and block sizes to launch my kernel:

dim3 blockDims(512);
dim3 gridDims((unsigned int) ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<< gridDims, blockDims>>>(...)

I'm wondering: in this case, since the data is 1D, does it matter if I use a dim3 structure? Any benefits over using

unsigned int num_blocks = ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<<num_blocks, 512>>>(...)

instead?

Also, is my understanding correct that when using dim3, I'll reference the thread ID with 2 indices inside my kernel:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

And when I'm not using dim3, I'll just use one index?

Thank you very much,

Best Answer

The way you arrange the data in memory is independently on how you would configure the threads of your kernel.

The memory is always a 1D continuous space of bytes. However, the access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

dim3 is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.

The same happens for the blocks and the grid.

So, in both cases: dim3 blockDims(512); and myKernel<<<num_blocks, 512>>>(...) you will always have access to threadIdx.y and threadIdx.z.

As the thread ids start at zero, you can calculate a memory position as a row major order using also the ydimension:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

int gid = img.col * y + x;

because blockIdx.y and threadIdx.y will be zero.

To sumup, it does it matter if you use a dim3 structure. I would be clear where the configuration of the threads has been defined, and the 1D, 2D and 3D access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

Hardware Constraints:

This is the easy to quantify part. Appendix F of the current CUDA programming guide lists a number of hard limits which limit how many threads per block a kernel launch can have. If you exceed any of these, your kernel will never run. They can be roughly summarized as:

Each block cannot have more than 512/1024 threads in total (Compute Capability 1.x or 2.x and later respectively)
The maximum dimensions of each block are limited to [512,512,64]/[1024,1024,64] (Compute 1.x/2.x or later)
Each block cannot consume more than 8k/16k/32k/64k/32k/64k/32k/64k/32k/64k registers total (Compute 1.0,1.1/1.2,1.3/2.x-/3.0/3.2/3.5-5.2/5.3/6-6.1/6.2/7.0)
Each block cannot consume more than 16kb/48kb/96kb of shared memory (Compute 1.x/2.x-6.2/7.0)

If you stay within those limits, any kernel you can successfully compile will launch without error.

Performance Tuning:

This is the empirical part. The number of threads per block you choose within the hardware constraints outlined above can and does effect the performance of code running on the hardware. How each code behaves will be different and the only real way to quantify it is by careful benchmarking and profiling. But again, very roughly summarized:

The number of threads per block should be a round multiple of the warp size, which is 32 on all current hardware.
Each streaming multiprocessor unit on the GPU must have enough active warps to sufficiently hide all of the different memory and instruction pipeline latency of the architecture and achieve maximum throughput. The orthodox approach here is to try achieving optimal hardware occupancy (what Roger Dahl's answer is referring to).

The second point is a huge topic which I doubt anyone is going to try and cover it in a single StackOverflow answer. There are people writing PhD theses around the quantitative analysis of aspects of the problem (see this presentation by Vasily Volkov from UC Berkley and this paper by Henry Wong from the University of Toronto for examples of how complex the question really is).

At the entry level, you should mostly be aware that the block size you choose (within the range of legal block sizes defined by the constraints above) can and does have a impact on how fast your code will run, but it depends on the hardware you have and the code you are running. By benchmarking, you will probably find that most non-trivial code has a "sweet spot" in the 128-512 threads per block range, but it will require some analysis on your part to find where that is. The good news is that because you are working in multiples of the warp size, the search space is very finite and the best configuration for a given piece of code relatively easy to find.

CUDA Thread Addressing ((threadIdx.x, threadIdx.y, threadIdx.z) and block addressing (blockidx.x, blockidx.y)

Creating 2D or 3D threadblocks is usually done because the problem lends itself to a 2D or 3D interpretation of the data, and handling it using a 2D or 3D threadblock may make the code more readable. But there's no specific reason why it cannot be done with a 1D threadblock with appropriate indexing.

Creating a 2D or 3D grid (of blocks) is usually done for the reason described above and/or to get around the limitation on pre CC 3.0 devices of the number of blocks in any one dimension of a grid (65535 max blocks in any dimension).

For the threadblock case, you can use 1024 threads in a single block in a single dimension, so you don't need to construct your ID variable with threadIdx.y or threadIdx.z if you don't want to.

If you have a pre CC 3.0 device, and your problem is large enough in terms of blocks, you may still want to construct a 2D grid. You can still use 1D threadblocks in that grid. In that case, a unique ID variable can be created like:

 int idx = threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x);

The above construct should handle 1D threadblocks with any 2D grid.

There are other methods besides constructing a 2D grid to work with large problem sizes, such as having your blocks handle multiple chunks of data in a loop of some sort.

Best Answer

Related Solutions

How to choose grid and block dimensions for CUDA kernels

Hardware Constraints:

Performance Tuning:

CUDA Thread Addressing ((threadIdx.x, threadIdx.y, threadIdx.z) and block addressing (blockidx.x, blockidx.y)

Related Topic