CUDA model – what is warp size

cudagpgpu

What's the relationship between maximum work group size and warp size? Let’s say my device has 240 CUDA streaming processors (SP) and returns the following information –

CL_DEVICE_MAX_COMPUTE_UNITS: 30

CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64

CL_DEVICE_MAX_WORK_GROUP_SIZE: 512

CL_NV_DEVICE_WARP_SIZE: 32

This means it has eight SPs per streaming multiprocessor (that is, compute unit). Now how is warp size = 32 related to these numbers?

Best Answer

Direct Answer: Warp size is the number of threads in a warp, which is a sub-division used in the hardware implementation to coalesce memory access and instruction dispatch.

Suggested Reading:

As @Matias mentioned, I'd go read the CUDA C Best Practices Guide (you'll have to scroll to the bottom where it's listed). It might help for you to stare at the table in Appendix G.1 on page 164.

Explanation:

CUDA is language which provides parallelism at two levels. You have threads and you have blocks of threads. This is most evident when you execute a kernel; you need to specify the size of each thread block and the number of thread blocks in between the <<< >>> which precede the kernel parameters.

What CUDA doesn't tell you is things are actually happening at four levels, not two. In the background, your block of threads is actually divided into sub-blocks called "warps". Here's a brief metaphor to help explain what's really going on:

Brief Metaphor:

Pretend you're an educator/researcher/politician who's interested in the current mathematical ability of high school seniors. Your plan is to give a test to 10,240 students, but you can't just put them all in a football stadium or something and give them the test. It is easiest to subdivide (parallelize) your data collection -- so you go to 20 different high school and ask that 512 of their seniors each take the math test.

The number of high schools, 20, is analagous to the number of "blocks" / "number of blocks of threads". The number of seniors, 512, is analagous to the number of threads in each block aka "threads per block".

You collect your data and that is all you care about. What you didn't know (and didn't really care about) is that each school is actually subdivided into classrooms. So your 512 seniors are actually divided into 16 groups of 32. And further, none of these schools really has the resources required -- each classroom only has sixteen calculators. Hence, at any one time only half of each classroom can take your math test.

The number of seniors, 512, represents the number of threads per block requested when launching a CUDA Kernel. The implementation hardware may further divide this into 16 sequential blocks of 32 threads to process the full number of requested threads, which is 512. The number 32 is the warp size, but this may vary on different hardware generations.

I could go on to stretch silly rules like only eight classrooms in any one school can take the test at one time because they only have eight teachers. You can't sample more than 30 schools simultaneously because you only have 30 proctors...

Back to your question:

Using the metaphor, your program wants to compute results as fast as possible (you want to collect math tests). You issue a kernel with a certain number of blocks (schools) each of which has a certain number of threads (students). You can only have so many blocks running at one time (collecting your survey responses requires one proctor per school). In CUDA, thread blocks run on a streaming multiprocessor (SM). The variable: CL_DEVICE_MAX_COMPUTE_UNITS tells you how many SMs, 30, a specific card has. This varies drastically based on the hardware -- check out the table in Appendix A of the CUDA C Best Practices Guide. Note that each SM can run only eight blocks simultaneously regardless of the compute capability (1.X or 2.X).

Thread blocks have maximum dimensions: CL_DEVICE_MAX_WORK_ITEM_SIZES. Think of laying out your threads in a grid; you can't have a row with more than 512 threads. You can't have a column with more than 512 threads. And you can't stack more than 64 threads high. Next, there is a maximum: CL_DEVICE_MAX_WORK_GROUP_SIZE number of threads, 512, that can be grouped together in a block. So your thread blocks' dimensions could be:

512 x 1 x 1

1 x 512 x 1

4 x 2 x 64

64 x 8 x 1

etc...

Note that as of Compute Capability 2.X, your blocks can have at most 1024 threads. Lastly, the variable CL_NV_DEVICE_WARP_SIZE specifies the warp size, 32 (number of students per classroom). In Compute Capability 1.X devices, memory transfers and instruction dispatch occur at the Half-Warp granularity (you only have 16 calculators per classroom). In Compute Capability 2.0, memory transfers are grouped by Warp, so 32 fetches simultaneously, but instruction dispatch is still only grouped by Half-Warp. For Compute Capability 2.1, both memory transfers and instruction dispatch occur by Warp, 32 threads. These things can and will change in future hardware.

So, my word! Let's get to the point:

In Summary:

I have described the nuances of warp/thread layout and other such stuff, but here are a couple of things to keep in mind. First, your memory access should be "groupable" in sets of 16 or 32. So keep the X dimension of your blocks a multiple of 32. Second, and most important to get the most from a specific gpu, you need to maximize occupancy. Don't have 5 blocks of 512 threads. And don't have 1,000 blocks of 10 threads. I would strongly recommend checking out the Excel-based spreadsheet (works in OpenOffice too?? I think??) which will tell you what the GPU occupancy will be for a specific kernel call (thread layout and shared memory requirements). I hope this explanation helps!

Related Solutions

How to get the CUDA version

As Jared mentions in a comment, from the command line:

nvcc --version

(or /usr/local/cuda/bin/nvcc --version) gives the CUDA compiler version (which matches the toolkit version).

From application code, you can query the runtime API version with

cudaRuntimeGetVersion()

or the driver API version with

cudaDriverGetVersion()

As Daniel points out, deviceQuery is an SDK sample app that queries the above, along with device capabilities.

As others note, you can also check the contents of the version.txt using (e.g., on Mac or Linux)

cat /usr/local/cuda/version.txt

However, if there is another version of the CUDA toolkit installed other than the one symlinked from /usr/local/cuda, this may report an inaccurate version if another version is earlier in your PATH than the above, so use with caution.

How do CUDA blocks/warps/threads map onto CUDA cores

Two of the best references are

I'll try to answer each of your questions.

The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units. For more details on execution units and instruction dispatch see 1 p.7-10 and 2.

4'. There is a mapping between laneid (threads index in a warp) and a core.

5'. If a warp contains less than 32 threads it will in most cases be executed the same as if it has 32 threads. Warps can have less than 32 active threads for several reasons: number of threads per block is not divisible by 32, the program execute a divergent block so threads that did not take the current path are marked inactive, or a thread in the warp exited.

6'. A thread block will be divided into WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1) / WarpSize There is no requirement for the warp schedulers to select two warps from the same thread block.

7'. An execution unit will not stall on a memory operation. If a resource is not available when an instruction is ready to be dispatched the instruction will be dispatched again in the future when the resource is available. Warps can stall at barriers, on memory operations, texture operations, data dependencies, ... A stalled warp is ineligible to be selected by the warp scheduler. On Fermi it is useful to have at least 2 eligible warps per cycle so that the warp scheduler can issue an instruction.

See reference 2 for differences between a GTX480 and GTX560.

If you read the reference material (few minutes) I think you will find that your goal does not make sense. I'll try to respond to your points.

1'. If you launch kernel<<<8, 48>>> you will get 8 blocks each with 2 warps of 32 and 16 threads. There is no guarantee that these 8 blocks will be assigned to different SMs. If 2 blocks are allocated to a SM then it is possible that each warp scheduler can select a warp and execute the warp. You will only use 32 of the 48 cores.

2'. There is a big difference between 8 blocks of 48 threads and 64 blocks of 6 threads. Let's assume that your kernel has no divergence and each thread executes 10 instructions.

8 blocks with 48 threads = 16 warps * 10 instructions = 160 instructions
64 blocks with 6 threads = 64 warps * 10 instructions = 640 instructions

In order to get optimal efficiency the division of work should be in multiples of 32 threads. The hardware will not coalesce threads from different warps.

3'. A GTX560 can have 8 SM * 8 blocks = 64 blocks at a time or 8 SM * 48 warps = 512 warps if the kernel does not max out registers or shared memory. At any given time on a portion of the work will be active on SMs. Each SM has multiple execution units (more than CUDA cores). Which resources are in use at any given time is dependent on the warp schedulers and instruction mix of the application. If you don't do TEX operations then the TEX units will be idle. If you don't do a special floating point operation the SUFU units will idle.

4'. Parallel Nsight and the Visual Profiler show

a. executed IPC

b. issued IPC

c. active warps per active cycle

d. eligible warps per active cycle (Nsight only)

e. warp stall reasons (Nsight only)

f. active threads per instruction executed

The profiler do not show the utilization percentage of any of the execution units. For GTX560 a rough estimate would be IssuedIPC / MaxIPC. For MaxIPC assume GF100 (GTX480) is 2 GF10x (GTX560) is 4 but target is 3 is a better target.

Best Answer

Related Solutions

How to get the CUDA version

How do CUDA blocks/warps/threads map onto CUDA cores

Related Topic