Electronic – FPGA maximum frequency : limiting factor

clock-speedfirfpgaintel-fpgaquartus

I would like to know which in general may limit the maximum clock frequency of a circuit implemented in FPGA.
In the specific case I am building some FIR filters using Quartus and simulating them on a FPGA of the Cyclone II family.

From my simulation it results that a II order FIR using a direct adders can be clocked at higher frequency than a II order FIR using transposed adders (420Mhz vs 387Mhz).
I did not expect this given that the critical path of the direct is bigger (2sum+1mult) than the one of the transposed (1s+1m).

Is this due to the fact that the direct has a more parallel architecture than the transposed and so the FPGA 'likes' this?
img1) direct
img2) transposed

Best Answer

I suspect the difference is due to the negative coefficient in the 2nd case (according to the order of your diagrams).

Because your multiplying coefficients are all powers of 2, your multiplies can all be done by simple bit selects. For example, assuming you're doing 16-bit math, x*0.25 can be calculated as simply {2'b0, x[15:2]} (using Verilog notation).

This means your multiplications with positive coefficients are essentially free, and require no time at all.

Multiplying by a negative coefficient, however, means making a 2's-complement calculation, requiring inverting the bits and adding 1. That "adding 1" step implies a carry chain with delay equivalent to an adder of the same width.

So now you're effectively comparing two systems that both have a critical path equivalent to two adders, and it's down to luck which one happens to synthesize with less delay.

If you're using SystemVerilog or some other higher-level synthesis tool, the tool might even notice that one of the sums in the first version can be pipelined (calculated one clock cycle in advance) and thus reduce the critical path to a single adder.

Related Solutions

Electronic – LUT vs. hard IP based multipliers on Spartan-3 FPGA for constant coefficient multiplication

According to the datasheet the hard multiplier takes between 4 and 5 ns to propogate from inputs to outputs in combinational mode. You'll lose a few more 100s of ps getting to and from the multiplier to the rest of your logic. If that's fast enough, then just make use of it.

If not, build your LUT-based multiplier by just writing some code with the * operator in it, synthesise it, place and route, and see if that's fast enough. You may needs an attribute to force it to not use the hard multipliers (see the MULT_STYLE attribute in the XST manual). You could even try just forcing a single LUT-based (non-constant) multiplier with that constraint and see what the result is - that's a very quick test.

Only if those fail should you go down the route of hand-building a LUT-based structure - and even then only if you've looked at the output of the synthesiser and are pretty sure you can beat it for some reason. The synthesisers have been tuned to work out constant coefficient multipliers very well in my experience - I doubt coregen will gain much.

Wet finger estimate: A LUT delay is ~0.7ns. Assuming routing delays are of a similar magnitude, you can afford a chain of only 3-4 LUTs in the delay of the hard multiplier. It seems unlikely to me that you'll achieve what you need in that depth of logic.

Using a mif file in Quartus

You can use the memory IP cores to create a memory with initial mif content. You can check the IP core user guide for more information.

Another solution is to use VHDL attributes to initialize the content of your variable. You have to be confident that your code is indeed interpreted as a ROM by altera, otherwise the attribute will be ignored. This is the example usage from altera's documentation:

type mem_t is array(0 to 255) of unsigned(7 downto 0);
signal ram : mem_t;
attribute ram_init_file : string;
attribute ram_init_file of ram : signal is "my_init_file.mif";

A third way, which I prefer, is to initialize the memory in VHDL code, for example:

type rom_t is array(0 to 1023) of signed(15 downto 0);

function fill_sin_rom return rom_t is
    variable ret : rom_t;
begin
    for i in ret'range loop
        ret(i) := to_signed(integer((2.0**15 - 1.0)*sin(2.0*MATH_PI*real(i)/1024.0)), 16);
    end loop;
    return ret;
end function fill_sin_rom;

constant sin_rom : rom_t := fill_sin_rom;

This code would be in a package for an architecture declaration sections. It requires use ieee.math_real.all and use ieee.numeric_std.all to work. The advantage of this solution is that it also works in simulation, while the attribute would not.

Best Answer

Related Solutions

Electronic – LUT vs. hard IP based multipliers on Spartan-3 FPGA for constant coefficient multiplication

Using a mif file in Quartus

Related Topic