Electronic – 2D convolution on 32×32 grayscale image on FPGA using verilog for inference of CNN

convolutionfpgaverilogzynq

Hi I am new to the world of convolutional neural networks and would like to implement a 2D convolution operation using the sliding window approach on a xilinx FPGA. The input to the image is a 32×32 image with 2 channel on which 6 kernels of 5×5 is convolved to produce 6 output feature maps. Now assuiming I have sufficient DSPs on the FPGA, how would I parallelise the problem? After doing some research I have understood that we can either parallelise over the input or output feature maps or the kernel or both. For a 5×5 kernel, I would need 25 multiplications and 25 additions including the bias. If I have 25 DSPs operating in parallel, I can achieve this in one clock cycle. Is my understanding of the problem of parallelization correct up to now?

Now considering the input is stored in buffers and streamed to my convolution module and weights are pre loaded to the module, how is the sliding window computation performed? I realise I would have to use counters to keep track of the position of the input till the end of the N_W and N_H respectively. There is quite a lot of literature about implementing this using systolic arrays of multipliers but I am not sure I get those.

Could someone help me understand a dataflow for the convolution operation?

Any help would be greatly appreciated.
An eager student.

Best Answer

The problem of processing a 2-d kernel of data over a large dataset (not just convolution) comes up so regularly in HD video processing that I came up with a generic way of handling it that I use all the time.

I developed a generic "kernel generator" that uses line buffers and registers to present all of the input data for a given output pixel in parallel. An N×N kernel requires N-1 line buffers and N-1 registers. It assumes that the data is arriving in "raster-scan" order, like a TV signal.

schematic

simulate this circuit – Schematic created using CircuitLab

The next stage could be your multipliers, but more often than not, I need to handle the edges in some special way, such as zeroing out the values that fall outside the input data, or reflecting the data across the edge, or whatever. Therefore, I have some standard modules (N2 pixels in, N2 pixels out) that consist of counters and multiplexers that do this edge processing before passing the data to the actual data processing module.

For a convolution, you can do all of the multiplies in parallel in the same clock period, but then the adds will have to be pipelined. For example, if you can only add two numbers in one clock period, you'll need a "tree" of adders that's 5 levels deep for a 5×5 kernel. If you can add three or four numbers at a time, you will only need 3 levels of pipeline.

Obviously, the same kernel generator can feed multiple convolutions in parallel if that's what you're doing, but I second Harry Svensson's notion of using FFT techniques if you're doing more than a few of them.