Electronic – Arithmetic Division in Verilog

fpgaverilogxilinx

module averager(
    clk,
    rst,
    n,
    sum,
    cnt,
    out,
    avg
 );

input  [9:0] n;
input clk;
input rst;
output reg [19:0] out;
output reg [9:0] cnt;
output reg [19:0] sum;
output reg [9:0] avg;

integer i = 0;

always @(posedge clk ) 
    if (rst == 1) begin 
        sum = 20'b0;
        cnt = 10'b0;
        out = 20'b0; 
        avg = 10'b0;
    end else if (rst == 0) begin
        sum = sum + n;
        out = sum;
        cnt = cnt + 1;
        avg = 0;

        for (i=0; i<641; i=i+1) begin
            if(out >= cnt) begin
                out = out - cnt;
                avg = avg + 1;
            end
        end
    end
endmodule

The above is the code to implement a cumulative moving average filter. The for loop is used for division to find the average and involves repeated subtraction. However I am getting the following warning and error:

WARNING:Xst:2254 - Area constraint could not be met for block <averager>, final ratio is 509.
WARNING:Xst:1336 -  (*) More than 100% of Device resources are used
ERROR:Pack:18 - The design is too large for the given device and package. 
  Please check the Design Summary section to see which resource requirement for
  your design exceeds the resources available in the device.

This must be because I am using large values in the for loop and thus am getting a large circuit that can't be implemented. I am looking for an alternative of the for loop, which could find the average for me. I just need the quotient value.

Design Properties:
Family: Spartan3E
Device: XC3S500E

Best Answer

You are correct with the guessing the for loop.

The for-loop logic is huge when it after it static unrolls. With your current code, you cannot handle the worst case scenario where n=1023. To cover this with your current code you'd need a for loop with 1024 iterations.

Instead of a up counter, you can use a down counter and only examine a slice of the array, where the index represents lsb of the array slice. For example:

for (i=9; i>=0; i=i-1) begin // lsb index of the slice
  if (out[i+:11] >= cnt) begin // 11-bit slice compare
    out[i+:11] = out[i+:11] - cnt; // 11-bit slice subtraction
    avg[i] = 1'b1; // 1-bit assign
  end
end

This for loop unravels to 10 iterations (9 to 0), each iteration only looks at a 11-bit slice of out and only one bit of avg. You might not be familiar with the +: operator. It is a bit-slice operator introduced in IEEE Std 1364-2001. Left side if the start index (dynamic is allowed) and the right side is the bit with offset (must be a static constant). You can read more about it here.

Since it is a count down, we can can safely assume (proven mathematically) the upper bits of the slice are zeros and we will never have underflow with the guarding if condition. So we now have ten 11-bit subtracters each with 1-bit assigners which is much smaller logic then the original 642 (should be 1024) 20-bit subtracters each with 10-bit adder.

Few other suggestions:

  1. Use ANSI style header (supported since IEEE Std 1364-2001). It is few lines of code and thereby less prone to typos and copy-paste mistakes.
  2. Separate the synchronous logic and combination logic. This does mean declaring more signnals, but generate gives you better control on what is a flop and what is comb logic. It also follows best practices. Your for loop should exits in the comb logic.
  3. Use non-blocking assignments (<=) in your synchronous logic block. This follows best coding practices and prevents flop-to-flop race conditions in simulation.
  4. out can be simplified to a 10-bit register, assuming you did suggestions 2 & 3. This is because we know the upper bits will always be zeros, so we can save 10 flops.

proof of concept: with a simple SystemVerilog testbench here:

module averager( // ANSI style header
  input              clk, rst,
  input [9:0]        n,
  output reg [19:0]  sum,
  output reg [9:0]   cnt,
  output reg [9:0]  out, // was [19:0]
  output reg [9:0]   avg
  );

  reg [19:0] next_sum, next_out;
  reg [9:0]  next_cnt, next_avg;

  integer i;

  always @* begin
    next_sum = sum + n;
    next_out = next_sum;
    next_cnt = cnt + 1'b1;
    next_avg = 10'b0;

    for (i=9; i>=0; i=i-1) begin // lsb index for slice
      if (next_out[i+:11] >= next_cnt) begin // 11-bit slice compare
        next_out[i+:11] = next_out[i+:11] - next_cnt; // 11-bit slice subtract
        next_avg[i] = 1'b1;  // 1-bit assign
      end
    end
  end

  always @(posedge clk) begin
    if (rst == 1'b1) begin
      sum <= 20'b0;
      cnt <= 10'b0;
      out <= 10'b0; // was 20'b0
      avg <= 10'b0;
    end
    else begin
      sum <= next_sum;
      cnt <= next_cnt;
      out <= next_out[9:0]; // only assign lsb
      avg <= next_avg;
    end
  end

endmodule

If you still experience area issues or performance is not good enough, then you should pipeline your design and/or seeing if there are there is dedicated divider+remainder module defined in your FPGA data sheet that you can instantiate.