Direct Answer: Warp size is the number of threads in a warp, which is a sub-division used in the hardware implementation to coalesce memory access and instruction dispatch.
Suggested Reading:
As @Matias mentioned, I'd go read the CUDA C Best Practices Guide (you'll have to scroll to the bottom where it's listed). It might help for you to stare at the table in Appendix G.1 on page 164.
Explanation:
CUDA is language which provides parallelism at two levels. You have threads and you have blocks of threads. This is most evident when you execute a kernel; you need to specify the size of each thread block and the number of thread blocks in between the <<< >>> which precede the kernel parameters.
What CUDA doesn't tell you is things are actually happening at four levels, not two. In the background, your block of threads is actually divided into sub-blocks called "warps". Here's a brief metaphor to help explain what's really going on:
Brief Metaphor:
Pretend you're an educator/researcher/politician who's interested in the current mathematical ability of high school seniors. Your plan is to give a test to 10,240 students, but you can't just put them all in a football stadium or something and give them the test. It is easiest to subdivide (parallelize) your data collection -- so you go to 20 different high school and ask that 512 of their seniors each take the math test.
The number of high schools, 20, is analagous to the number of "blocks" / "number of blocks of threads". The number of seniors, 512, is analagous to the number of threads in each block aka "threads per block".
You collect your data and that is all you care about. What you didn't know (and didn't really care about) is that each school is actually subdivided into classrooms. So your 512 seniors are actually divided into 16 groups of 32. And further, none of these schools really has the resources required -- each classroom only has sixteen calculators. Hence, at any one time only half of each classroom can take your math test.
The number of seniors, 512, represents the number of threads per block requested when launching a CUDA Kernel. The implementation hardware may further divide this into 16 sequential blocks of 32 threads to process the full number of requested threads, which is 512. The number 32 is the warp size, but this may vary on different hardware generations.
I could go on to stretch silly rules like only eight classrooms in any one school can take the test at one time because they only have eight teachers. You can't sample more than 30 schools simultaneously because you only have 30 proctors...
Back to your question:
Using the metaphor, your program wants to compute results as fast as possible (you want to collect math tests). You issue a kernel with a certain number of blocks (schools) each of which has a certain number of threads (students). You can only have so many blocks running at one time (collecting your survey responses requires one proctor per school). In CUDA, thread blocks run on a streaming multiprocessor (SM). The variable: CL_DEVICE_MAX_COMPUTE_UNITS
tells you how many SMs, 30, a specific card has. This varies drastically based on the hardware -- check out the table in Appendix A of the CUDA C Best Practices Guide. Note that each SM can run only eight blocks simultaneously regardless of the compute capability (1.X or 2.X).
Thread blocks have maximum dimensions: CL_DEVICE_MAX_WORK_ITEM_SIZES
. Think of laying out your threads in a grid; you can't have a row with more than 512 threads. You can't have a column with more than 512 threads. And you can't stack more than 64 threads high. Next, there is a maximum: CL_DEVICE_MAX_WORK_GROUP_SIZE
number of threads, 512, that can be grouped together in a block. So your thread blocks' dimensions could be:
512 x 1 x 1
1 x 512 x 1
4 x 2 x 64
64 x 8 x 1
etc...
Note that as of Compute Capability 2.X, your blocks can have at most 1024 threads. Lastly, the variable CL_NV_DEVICE_WARP_SIZE
specifies the warp size, 32 (number of students per classroom). In Compute Capability 1.X devices, memory transfers and instruction dispatch occur at the Half-Warp granularity (you only have 16 calculators per classroom). In Compute Capability 2.0, memory transfers are grouped by Warp, so 32 fetches simultaneously, but instruction dispatch is still only grouped by Half-Warp. For Compute Capability 2.1, both memory transfers and instruction dispatch occur by Warp, 32 threads. These things can and will change in future hardware.
So, my word! Let's get to the point:
In Summary:
I have described the nuances of warp/thread layout and other such stuff, but here are a couple of things to keep in mind. First, your memory access should be "groupable" in sets of 16 or 32. So keep the X dimension of your blocks a multiple of 32. Second, and most important to get the most from a specific gpu, you need to maximize occupancy. Don't have 5 blocks of 512 threads. And don't have 1,000 blocks of 10 threads. I would strongly recommend checking out the Excel-based spreadsheet (works in OpenOffice too?? I think??) which will tell you what the GPU occupancy will be for a specific kernel call (thread layout and shared memory requirements). I hope this explanation helps!
Best Answer
A GTX590 contains 2x the numbers you mentioned, since there are 2 GPUs on the card. Below, I focus on a single chip.
Block are not necessarily distributed evenly across the multiprocessors (SMs). If you schedule exactly 16 blocks, a few of the SMs can get 2 or 3 blocks while a few of them go idle. I don't know why.
The relationship between threads and cores is not that direct. There are 32 "basic" ALUs in each SM. The ones that handle such things as single precision floating point and most 32 bit integer and logic instructions. But there are only 16 load/store units, so if the warp instruction that is currently being processed is a load/store, it must be scheduled twice. And there are only 4 special function units, that do things such as trigonometry. So these instructions must be scheduled 32 / 4 = 8 times.
No, there can be many more than 32 threads "in flight" at the same time in a single SM.
No, it is not only memory operations that cause warps to be replaced. The ALUs are also deeply pipelined, so new warps will be swapped in as data dependencies occur for values that are still in the pipeline. So, if the code contains two instructions where the second one uses the output from the first, the warp will be put on hold while the value from the first instruction makes its way through the pipeline.
A multiprocessor can process more than one block at a time but a block cannot move to another MP once processing on it has started. The number of threads in a block that are currently in flight depends on how many resources the block uses. The CUDA Occupancy Calculator will tell you how many blocks will be in flight at the same time based on the resource usage of your specific kernel.
No, a block cannot be divided to work on two multiprocessors. A whole block is always processed by a single multiprocessor. If the given multiprocessor does not have enough resources to process at least one block with your kernel, you will get a kernel launch error and your program won't run at all.
It depends on how you define a thread as "running". The GPU will typically have many more than 512 threads consuming various resources on the chip at the same time.
See @harrism's answer in this question: CUDA: How many concurrent threads in total?