Electronic – 2D convolution on 32×32 grayscale image on FPGA using verilog for inference of CNN

convolutionfpgaverilogzynq

Hi I am new to the world of convolutional neural networks and would like to implement a 2D convolution operation using the sliding window approach on a xilinx FPGA. The input to the image is a 32×32 image with 2 channel on which 6 kernels of 5×5 is convolved to produce 6 output feature maps. Now assuiming I have sufficient DSPs on the FPGA, how would I parallelise the problem? After doing some research I have understood that we can either parallelise over the input or output feature maps or the kernel or both. For a 5×5 kernel, I would need 25 multiplications and 25 additions including the bias. If I have 25 DSPs operating in parallel, I can achieve this in one clock cycle. Is my understanding of the problem of parallelization correct up to now?

Now considering the input is stored in buffers and streamed to my convolution module and weights are pre loaded to the module, how is the sliding window computation performed? I realise I would have to use counters to keep track of the position of the input till the end of the N_W and N_H respectively. There is quite a lot of literature about implementing this using systolic arrays of multipliers but I am not sure I get those.

Could someone help me understand a dataflow for the convolution operation?

Any help would be greatly appreciated.
An eager student.

Best Answer

The problem of processing a 2-d kernel of data over a large dataset (not just convolution) comes up so regularly in HD video processing that I came up with a generic way of handling it that I use all the time.

I developed a generic "kernel generator" that uses line buffers and registers to present all of the input data for a given output pixel in parallel. An N×N kernel requires N-1 line buffers and N-1 registers. It assumes that the data is arriving in "raster-scan" order, like a TV signal.

schematic

^{simulate this circuit – Schematic created using CircuitLab}

The next stage could be your multipliers, but more often than not, I need to handle the edges in some special way, such as zeroing out the values that fall outside the input data, or reflecting the data across the edge, or whatever. Therefore, I have some standard modules (N² pixels in, N² pixels out) that consist of counters and multiplexers that do this edge processing before passing the data to the actual data processing module.

For a convolution, you can do all of the multiplies in parallel in the same clock period, but then the adds will have to be pipelined. For example, if you can only add two numbers in one clock period, you'll need a "tree" of adders that's 5 levels deep for a 5×5 kernel. If you can add three or four numbers at a time, you will only need 3 levels of pipeline.

Obviously, the same kernel generator can feed multiple convolutions in parallel if that's what you're doing, but I second Harry Svensson's notion of using FFT techniques if you're doing more than a few of them.

Related Solutions

Electronic – What FPGA chips support verilog-ams or verilog-a

I know of no significant FPGA manufacturer that supports analog-anything, so I don't see why they would support Verilog-AMS or Verilog-A.

For the record (because someone will try to point this out as a "flaw" in my logic): PLL's, ADC's, DAC's, and high speed transceivers are NOT analog circuits in this context. Although modern FPGA's have these blocks in them, the analog parts are not re-configurable and the analog stuff is not exposed to the Verilog/VHDL programmer.

Every couple of years some new FPGA maker pops up claiming to have "Analog FPGA's" or some similar buzzword. Most of the time it's the same FPGA as before, just bought by a new company. I've looked into it in the past and the analog performance wasn't great (signal to noise, dynamic range, frequency response, etc). Usually these "new" FPGA companies never go into production and soon close their doors. I can't remember what the names of any of these companies are, and a quick Google search shows no manufacturers doing anything in the past year.

The closest to an "Analog FPGA" that I know of today is the Cypress Semiconductor PSoC stuff. Although you have to use their graphical user interface to configure the analog blocks and can't use anything like a "language".

Fpga verilog dual access

The practical FPGA register (a set of D flip-flops) only has one clock pin so it serves exactly one clock domain. Keep this low-level hardware in mind when writing HDL code.

When a signal crosses from one clock domain into another, it's common to use a pipeline of three registers, with the first stage in the source clock domain and the second and third registers in the destination clock domain. This causes some predictable latency, but avoids race conditions and undefined behavior.

// From Xilinx ISE 14.1 Language Templates
// Verilog | Synthesis Constructs | Coding Examples | Misc | Asynchronous Input Synchronization
module async_input_sync(
    input clk,
    (* TIG="TRUE", IOB="FALSE" *) input async_in,
    output reg sync_out
    );
(* ASYNC_REG="TRUE", SHREG_EXTRACT="NO", HBLKNM="sync_reg" *) reg [1:0] sreg;
always @(posedge clk) begin
    sync_out <= sreg[1];
    sreg <= {sreg[0], async_in};
end
endmodule

Synchronize pci_data, pci_wr, and pci_addr into the mcclk clock domain. For example:

wire [xxx] pci_mc_data; // pci_ signal in pciclk clock domain, source mc_data
wire pci_mc_wr; // pci_ signal in pciclk clock domain, source mc_data
wire [xxx] pci_mc_addr; // pci_ signal in pciclk clock domain, source mc_data
async_input_sync(pciclk, mc_data[xxx], pci_mc_data[xx]);
async_input_sync(pciclk, mc_wr, pci_mc_wr);
async_input_sync(pciclk, mc_addr[xxx], pci_mc_addr[xx]);

Synchronize mc_data, mc_wr, and mc_addr into the pciclk clock domain. For example:

wire [xxx] mc_pci_data; // mc_ signal in mcclk clock domain, source pci_data
async_input_sync(mcclk, pci_data[xxx], mc_pci_data[xx]);

You will have one register in the pciclk clock domain, and another register in the mcclk clock domain. Both registers have the same data after the signal propagates across the clock domain. For example, the register on the pci side might look like this:

// All of these signals are in the pciclk clock domain
always @(posedge pciclk) begin
    if (pci_wr & pci_addr == 0) begin
        pci_cntl_reg <= pci_data;
    end
    if (pci_mc_wr & pci_mc_addr == 0) begin
        pci_cntl_reg <= pci_mc_data;
    end
end

Also note the use of non-blocking assignment <= to help the synthesis tool recognize that you're requesting a set of D flip flops. The verilog register keyword is just a data type, it doesn't necessarily always result in synthesizing a D flip-flop.

Best Answer

Related Solutions

Electronic – What FPGA chips support verilog-ams or verilog-a

Fpga verilog dual access

Related Topic