Verilog – Connecting Multipliers and Adders Elegantly

mathparallelsystem-verilogverilogvivado

I am attempting to write a synthesizable verilog (or Systemverilog) module. I also want to make the modul parameterizable, which has presented a problem when trying to connect the following structure of multipliers and adders:
4 column vector multiplier

The problem with this is the following. If I want to make it parameterizable, I don't know how to connect it in verilog. So to be more specific. I want to know which verilog construct to use to achieve this interconnections in a elegant manner.
I have tried with the code bellow, but I get:
"Single value range is not allowed in packed dimension".

The code bellow is Systemverilog:

module vector_multiply #(parameter  N = 8, parameter M = 3)(
    input   clk,
    input   ce,
    input   rst,
    input   [N-1:0] a [M-1:0],
    input   [N-1:0] b [M-1:0],
    output reg [N-1:0] res // result of vector dot product
    );

 localparam ADDER_DEPTH = $clog2(M);
 wire [N-1:0] w [1<<ADDER_DEPTH][ADDER_DEPTH+1]; // multiplier/adder interconnect wires
 reg [N-1:0] mult_res [M-1:0];
 genvar i,j;

// MULT_MACRO: Multiply Function implemented in a DSP48E
//             Artix-7
// Xilinx HDL Language Template, version 2018.2

for (i = 0; i < M; i = i + 1)
begin
    MULT_MACRO #(
    .DEVICE("7SERIES"), // Target Device: "7SERIES" 
    .LATENCY(3),        // Desired clock cycle latency, 0-4
    .WIDTH_A(N),        // Multiplier A-input bus width, 1-25
    .WIDTH_B(N),        // Multiplier B-input bus width, 1-18
    .WIDTH_P(N)
) MULT_MACRO_inst (
   .P(mult_res[i]),     // Multiplier output bus, width determined by WIDTH_P parameter
   .A(a[i]),     // Multiplier input A bus, width determined by WIDTH_A parameter
   .B(b[i]),     // Multiplier input B bus, width determined by WIDTH_B parameter
   .CE(ce),   // 1-bit active high input clock enable
   .CLK(clk), // 1-bit positive edge clock input
   .RST(rst)  // 1-bit input active high reset
    );
end


for (i = 0; i < M; i = i + 1) begin
    w[0][i] = multi_res[i];
end

// If there is a odd number of elements in a vector (multiplications) then, add one so that all adders have a defined input
if ((M % 2) == 1) begin
    w[0][M] = 0; 
end


// The multiplicated results need to be adder (dot product).
for (i = 0; i > ADDER_DEPTH; i = i + 1) begin : layer_loop
   for (j = 0; j < (1<<(i-1)); j = j + 2) begin : inner_loop
      full_adder_Nb #(N) FA_Nb(w[i][j], w[i][j+1], w[i+1][j/2],0);
   end : inner_loop
end :layer_loop

endmodule

In this semi akward atempt I get errors in the following lines:
– w[0][i] = multi_res[i];
– w[0][M] = 0;

Again I am looking for either suggestions how to fix this code, or some better way to write it.

NOTE: N is the number of bits a number has, M is the number of elements in a vector (the pic bellow is for M = 4).

Best Answer

Given that you are building a tree which doubles each row, it's quite a simple structure to build. @awjlogan has given an explanation of the error messages, but I wanted to show an alternative structure.

M is the number of multipliers, which will also be the total number of adders plus one.

If we assign an index to each adder, starting with 1 as the final stage (I'm skipping 0), we can actually build the structure with a for loop for the multipliers, and single for loop for the adders.

// We need M-1 for adder outputs (hence starting at 1), and another M for the multiplier outputs
wire [N-1:0] w [(2*M)-1:1];

// Generate Block
genvar i;
generate

// Connect all the multipliers to the higher numberd wires
for (i = 0; i < M; i = i + 1) begin : multiplier_loop
    localparam j = M + i;   //This is the index in the w bus.
    MULT_MACRO #(
        .DEVICE("7SERIES"), // Target Device: "7SERIES" 
        .LATENCY(3),        // Desired clock cycle latency, 0-4
        .WIDTH_A(N),        // Multiplier A-input bus width, 1-25
        .WIDTH_B(N),        // Multiplier B-input bus width, 1-18
        .WIDTH_P(N)
    ) MULT_MACRO_inst (
       .P(w[j]),            // Multiplier output bus, width determined by WIDTH_P parameter
       .A(a[i]),            // Multiplier input A bus, width determined by WIDTH_A parameter
       .B(b[i]),            // Multiplier input B bus, width determined by WIDTH_B parameter
       .CE(ce),             // 1-bit active high input clock enable
       .CLK(clk),           // 1-bit positive edge clock input
       .RST(rst)            // 1-bit input active high reset
    );
end

// Connect all the adder chains together
for (i = 1; i < M; i = i + 1) begin : adder_loop
      full_adder_Nb #(N) FA_Nb(w[i], w[{i,1'b0}], w[{i,1'b1}],0);
      //For the adder inputs, we effective multiply the current adder index by 2, then access adders
      //with LSBs of 0 and 1. This is thanks to the binary tree structure. It wouldn't work if we
      //started with an index of 0.
end

// Finish generating
endgenerate

//The output is then 
assign res = w[1];

(Note: Code is untested, but should work in theory).

This works due to the indexing of the adders being a standard binary tree. If you take the index of the current adder, and multiply it by 2, you end up with the base index of the two adder outputs up a level in the tree. Simply concatenating 1'b0 and 1'b1 therefore does the address calculation as required.

Here is an example of the indexing for 4 levels (M = 8):

                   __
[x]- 01111 w[15] -| +|               __
[x]- 01110 w[14] -|__|- 00111 w[7] -| +|
                   __               |  |               __
[x]- 01101 w[13] -| +|              |  |- 00011 w[3] -| +|
[x]- 01100 w[12] -|__|- 00110 w[6] -|__|              |  |
                   __                __               |  |- 00001 w[1]
[x]- 01011 w[11] -| +|- 00101 w[5] -| +|              |  |
[x]- 01010 w[10] -|__|             -|  |- 00010 w[2] -|__|
                   __               |  |
[x]- 01001 w[9 ] -| +|- 00100 w[4] -|__|
[x]- 01000 w[8 ] -|__| 
                            Each adder connects to {idx,1} and {idx,0}

This should work for any size chain. However be aware it does not account for overflows. Really for each level you need one extra bit in the width of the adder to cope with overflow in the summation. You can achieve this by basically making another parameter used for the output width of all adders and multipliers equal to N + $clog2(M). Any extra bits mid chain should optimise away.