How to choose grid and block dimensions for CUDA kernels

cudagpunvidiaoptimizationperformance

This is a question about how to determine the CUDA grid, block and thread sizes. This is an additional question to the one posted here.

Following this link, the answer from talonmies contains a code snippet (see below). I don't understand the comment "value usually chosen by tuning and hardware constraints".

I haven't found a good explanation or clarification that explains this in the CUDA documentation. In summary, my question is how to determine the optimal blocksize (number of threads) given the following code:

const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / nthreads; // value determine by block size and total work
madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);

Best Answer

There are two parts to that answer (I wrote it). One part is easy to quantify, the other is more empirical.

Hardware Constraints:

This is the easy to quantify part. Appendix F of the current CUDA programming guide lists a number of hard limits which limit how many threads per block a kernel launch can have. If you exceed any of these, your kernel will never run. They can be roughly summarized as:

Each block cannot have more than 512/1024 threads in total (Compute Capability 1.x or 2.x and later respectively)
The maximum dimensions of each block are limited to [512,512,64]/[1024,1024,64] (Compute 1.x/2.x or later)
Each block cannot consume more than 8k/16k/32k/64k/32k/64k/32k/64k/32k/64k registers total (Compute 1.0,1.1/1.2,1.3/2.x-/3.0/3.2/3.5-5.2/5.3/6-6.1/6.2/7.0)
Each block cannot consume more than 16kb/48kb/96kb of shared memory (Compute 1.x/2.x-6.2/7.0)

If you stay within those limits, any kernel you can successfully compile will launch without error.

Performance Tuning:

This is the empirical part. The number of threads per block you choose within the hardware constraints outlined above can and does effect the performance of code running on the hardware. How each code behaves will be different and the only real way to quantify it is by careful benchmarking and profiling. But again, very roughly summarized:

The number of threads per block should be a round multiple of the warp size, which is 32 on all current hardware.
Each streaming multiprocessor unit on the GPU must have enough active warps to sufficiently hide all of the different memory and instruction pipeline latency of the architecture and achieve maximum throughput. The orthodox approach here is to try achieving optimal hardware occupancy (what Roger Dahl's answer is referring to).

The second point is a huge topic which I doubt anyone is going to try and cover it in a single StackOverflow answer. There are people writing PhD theses around the quantitative analysis of aspects of the problem (see this presentation by Vasily Volkov from UC Berkley and this paper by Henry Wong from the University of Toronto for examples of how complex the question really is).

At the entry level, you should mostly be aware that the block size you choose (within the range of legal block sizes defined by the constraints above) can and does have a impact on how fast your code will run, but it depends on the hardware you have and the code you are running. By benchmarking, you will probably find that most non-trivial code has a "sweet spot" in the 128-512 threads per block range, but it will require some analysis on your part to find where that is. The good news is that because you are working in multiples of the warp size, the search space is very finite and the best configuration for a given piece of code relatively easy to find.

Related Solutions

How to get the CUDA version

As Jared mentions in a comment, from the command line:

nvcc --version

(or /usr/local/cuda/bin/nvcc --version) gives the CUDA compiler version (which matches the toolkit version).

From application code, you can query the runtime API version with

cudaRuntimeGetVersion()

or the driver API version with

cudaDriverGetVersion()

As Daniel points out, deviceQuery is an SDK sample app that queries the above, along with device capabilities.

As others note, you can also check the contents of the version.txt using (e.g., on Mac or Linux)

cat /usr/local/cuda/version.txt

However, if there is another version of the CUDA toolkit installed other than the one symlinked from /usr/local/cuda, this may report an inaccurate version if another version is earlier in your PATH than the above, so use with caution.

CUDA – Multiprocessors, Warp size and Maximum Threads Per Block: What is the exact relationship

In my workplace I am working with a GTX 590, which contains 512 CUDA cores, 16 multiprocessors and which has a warp size of 32. So this means there are 32 CUDA cores in each multiprocessor which works exactly on the same code in the same warp. And finally the maximum threads per block size is 1024.

A GTX590 contains 2x the numbers you mentioned, since there are 2 GPUs on the card. Below, I focus on a single chip.

Let me tell my understanding of the situation: For example I allocate N blocks with the maximum threadPerBlock size of 1024 on the GTX 590. As far as I understand from the CUDA programming guide and from other sources, the blocks are firstly enumerated by the hardware. In this case 16 from the N blocks are assigned to different multiprocessors.

Block are not necessarily distributed evenly across the multiprocessors (SMs). If you schedule exactly 16 blocks, a few of the SMs can get 2 or 3 blocks while a few of them go idle. I don't know why.

Each block contains 1024 threads and the hardware scheduler assigns 32 of these threads to the 32 cores in a single multiprocessor.

The relationship between threads and cores is not that direct. There are 32 "basic" ALUs in each SM. The ones that handle such things as single precision floating point and most 32 bit integer and logic instructions. But there are only 16 load/store units, so if the warp instruction that is currently being processed is a load/store, it must be scheduled twice. And there are only 4 special function units, that do things such as trigonometry. So these instructions must be scheduled 32 / 4 = 8 times.

The threads in the same multiprocessor (warp) process the same line of the code and use shared memory of the current multiproccessor.

No, there can be many more than 32 threads "in flight" at the same time in a single SM.

If the current 32 threads encounter an off-chip operation like memory read-writes, they are replaced with an another group of 32 threads from the current block. So, there are actually 32 threads in a single block which are exactly running in parallel on a multiprocessor in any given time, not the whole of the 1024.

No, it is not only memory operations that cause warps to be replaced. The ALUs are also deeply pipelined, so new warps will be swapped in as data dependencies occur for values that are still in the pipeline. So, if the code contains two instructions where the second one uses the output from the first, the warp will be put on hold while the value from the first instruction makes its way through the pipeline.

Finally, if a block is completely processed by a multiprocessor, a new thread block from the list of the N thread blocks is plugged into the current multiprocessor.

A multiprocessor can process more than one block at a time but a block cannot move to another MP once processing on it has started. The number of threads in a block that are currently in flight depends on how many resources the block uses. The CUDA Occupancy Calculator will tell you how many blocks will be in flight at the same time based on the resource usage of your specific kernel.

And finally there are a total of 512 threads running in parallel in the GPU during the execution of the CUDA kernel. (I know that if a block uses more registers than available on a single multiprocessor then it is divided to work on two multiprocessors but lets assume that each block can fit into a single multiprocessor in our case.)

No, a block cannot be divided to work on two multiprocessors. A whole block is always processed by a single multiprocessor. If the given multiprocessor does not have enough resources to process at least one block with your kernel, you will get a kernel launch error and your program won't run at all.

It depends on how you define a thread as "running". The GPU will typically have many more than 512 threads consuming various resources on the chip at the same time.

See @harrism's answer in this question: CUDA: How many concurrent threads in total?