CUDA – Multiprocessors, Warp size and Maximum Threads Per Block: What is the exact relationship

cachingcudamemorytextures

I know that there are multiprocessors on a CUDA GPU which contain CUDA cores in them. In my workplace I am working with a GTX 590, which contains 512 CUDA cores, 16 multiprocessors and which has a warp size of 32. So this means there are 32 CUDA cores in each multiprocessor which works exactly on the same code in the same warp. And finally the maximum threads per block size is 1024.

My question is how the block size and the multiprocessor count – warp size are exactly related. Let me tell my understanding of the situation: For example I allocate N blocks with the maximum threadPerBlock size of 1024 on the GTX 590. As far as I understand from the CUDA programming guide and from other sources, the blocks are firstly enumerated by the hardware. In this case 16 from the N blocks are assigned to different multiprocessors. Each block contains 1024 threads and the hardware scheduler assigns 32 of these threads to the 32 cores in a single multiprocessor. The threads in the same multiprocessor (warp) process the same line of the code and use shared memory of the current multiproccessor. If the current 32 threads encounter an off-chip operation like memory read-writes, they are replaced with an another group of 32 threads from the current block. So, there are actually 32 threads in a single block which are exactly running in parallel on a multiprocessor in any given time, not the whole of the 1024. Finally, if a block is completely processed by a multiprocessor, a new thread block from the list of the N thread blocks is plugged into the current multiprocessor. And finally there are a total of 512 threads running in parallel in the GPU during the execution of the CUDA kernel. (I know that if a block uses more registers than available on a single multiprocessor then it is divided to work on two multiprocessors but lets assume that each block can fit into a single multiprocessor in our case.)

So, is my model of the CUDA parallel execution is correct? If not, what is wrong or missing? I want to fine tune the current project I am working on, so I need the most correct working model of the whole thing.

Best Answer

In my workplace I am working with a GTX 590, which contains 512 CUDA cores, 16 multiprocessors and which has a warp size of 32. So this means there are 32 CUDA cores in each multiprocessor which works exactly on the same code in the same warp. And finally the maximum threads per block size is 1024.

A GTX590 contains 2x the numbers you mentioned, since there are 2 GPUs on the card. Below, I focus on a single chip.

Let me tell my understanding of the situation: For example I allocate N blocks with the maximum threadPerBlock size of 1024 on the GTX 590. As far as I understand from the CUDA programming guide and from other sources, the blocks are firstly enumerated by the hardware. In this case 16 from the N blocks are assigned to different multiprocessors.

Block are not necessarily distributed evenly across the multiprocessors (SMs). If you schedule exactly 16 blocks, a few of the SMs can get 2 or 3 blocks while a few of them go idle. I don't know why.

Each block contains 1024 threads and the hardware scheduler assigns 32 of these threads to the 32 cores in a single multiprocessor.

The relationship between threads and cores is not that direct. There are 32 "basic" ALUs in each SM. The ones that handle such things as single precision floating point and most 32 bit integer and logic instructions. But there are only 16 load/store units, so if the warp instruction that is currently being processed is a load/store, it must be scheduled twice. And there are only 4 special function units, that do things such as trigonometry. So these instructions must be scheduled 32 / 4 = 8 times.

The threads in the same multiprocessor (warp) process the same line of the code and use shared memory of the current multiproccessor.

No, there can be many more than 32 threads "in flight" at the same time in a single SM.

If the current 32 threads encounter an off-chip operation like memory read-writes, they are replaced with an another group of 32 threads from the current block. So, there are actually 32 threads in a single block which are exactly running in parallel on a multiprocessor in any given time, not the whole of the 1024.

No, it is not only memory operations that cause warps to be replaced. The ALUs are also deeply pipelined, so new warps will be swapped in as data dependencies occur for values that are still in the pipeline. So, if the code contains two instructions where the second one uses the output from the first, the warp will be put on hold while the value from the first instruction makes its way through the pipeline.

Finally, if a block is completely processed by a multiprocessor, a new thread block from the list of the N thread blocks is plugged into the current multiprocessor.

A multiprocessor can process more than one block at a time but a block cannot move to another MP once processing on it has started. The number of threads in a block that are currently in flight depends on how many resources the block uses. The CUDA Occupancy Calculator will tell you how many blocks will be in flight at the same time based on the resource usage of your specific kernel.

And finally there are a total of 512 threads running in parallel in the GPU during the execution of the CUDA kernel. (I know that if a block uses more registers than available on a single multiprocessor then it is divided to work on two multiprocessors but lets assume that each block can fit into a single multiprocessor in our case.)

No, a block cannot be divided to work on two multiprocessors. A whole block is always processed by a single multiprocessor. If the given multiprocessor does not have enough resources to process at least one block with your kernel, you will get a kernel launch error and your program won't run at all.

It depends on how you define a thread as "running". The GPU will typically have many more than 512 threads consuming various resources on the chip at the same time.

See @harrism's answer in this question: CUDA: How many concurrent threads in total?

Related Topic