Cuda block/grid dimensions: when to use dim3

cudagpu

I need some clearing up regarding the use of dim3 to set the number of threads in my CUDA kernel.

I have an image in a 1D float array, which I'm copying to the device with:

checkCudaErrors(cudaMemcpy( img_d, img.data, img.row * img.col * sizeof(float), cudaMemcpyHostToDevice));

Now I need to set the grid and block sizes to launch my kernel:

dim3 blockDims(512);
dim3 gridDims((unsigned int) ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<< gridDims, blockDims>>>(...)

I'm wondering: in this case, since the data is 1D, does it matter if I use a dim3 structure? Any benefits over using

unsigned int num_blocks = ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<<num_blocks, 512>>>(...)

instead?

Also, is my understanding correct that when using dim3, I'll reference the thread ID with 2 indices inside my kernel:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

And when I'm not using dim3, I'll just use one index?

Thank you very much,

Best Answer

The way you arrange the data in memory is independently on how you would configure the threads of your kernel.

The memory is always a 1D continuous space of bytes. However, the access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

dim3 is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.

The same happens for the blocks and the grid.

Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#dim3

So, in both cases: dim3 blockDims(512); and myKernel<<<num_blocks, 512>>>(...) you will always have access to threadIdx.y and threadIdx.z.

As the thread ids start at zero, you can calculate a memory position as a row major order using also the ydimension:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

int gid = img.col * y + x; 

because blockIdx.y and threadIdx.y will be zero.

To sumup, it does it matter if you use a dim3 structure. I would be clear where the configuration of the threads has been defined, and the 1D, 2D and 3D access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

Related Topic