Electronic – FPGA firmware design: How big is too big

fpgavhdlxilinx

I have a particularly large signal processing transform that needs to be ported from matlab to VHDL. It definitely requires some kind of resource sharing. A bit of calculation gave me the following:

512 ffts of 64-points
41210 multiply-add operations

Considering the largest Virtex 6 FPGA has ~2000 DSP48E blocks, I know that I can resource-share in order to re-use the resources multiple times. Execution time isn't really an issue, the processing time can take relatively long in FPGA terms.

Looking at resource usage, using radix-2 lite architecture gets me 4dsp blocks/FFT operation = 2048 DSP blocks, a total of ~43k. biggest Virtex FPGA has 2k blocks, or 20 operations/mux.

Obviously including such large muxes into the fabric is also going to take up slices. Where do I find the upper end of this limit? I can't infinitely share the FPGA resources. Is 41210 multipliers too big? How do I calculate what is too big?

I've also looked at other resources (Slices, Brams, etc). Radix-2 Lite also gives 4 x 18k brams/fft = 2048 brams biggest Xilinx FPGA contains 2128 Brams. very borderline. I'm concerned that my design is just too big.

UPDATE:

Some more info on the design itself. I can't go into detail, but here is what I can give:

Initial conditions -> 512 ffts -> 40k multipliers ---------|----> output data to host 

                 ^------re-calculate initial conditions----|

output datarate spec: "faster than the matlab simulation"

calculations wise, this is where I am:

FFT stage: easy. I can implement 1/2/4/8 FFTs, store the results in SDRAM and access later. Relatively small, even if it takes long, it's ok. using radix-2 lite I can get 2 DSP48Es and 2 18k BRAMS/FFT. streaming gives 6 DSP48Es 0BRAMS/FFT. in either case, 64 point FFT is small in FPGA resource terms.

Multipliers: this is my problem. The multiplication inputs are taken from either lookup tables or FFT data. It really is just a whole bunch of multiply-adds. There isn't much to optimise. Not a filter, but has characteristics similar to a filter.

Considering resource sharing on the FPGA, maths works out as follows:
One LUT-6 can be used as a 4-way mux. The formula for an N-way, M bit mux is as follows:

N*M/3 = number of luts, or N*M/12 = slices (4 LUTS/slice).

crunching the numbers for my implementation does not give good results. 90% of the virtix-6 family doesn't have enough slices to resource-share their DSPs in order to perform 40k operations.

Best Answer

I wonder if there is another way of looking at the problem?

Playing off your estimation of 512 FFT operations (64 point each) and 42k MAC operations... I presume this is what you need for one pass through the algorithm?

Now you have found an FFT core using 4 DSP units ... but how many clock cycles does it take per FFT? (throughput, not latency)? Let's say 64, or 1 cycle per point. Then you have to complete those 42k Mac operations in 64 cycles - perhaps 1k MACs per cycle, with each MAC handling 42 operations.

Now it is time to look at the rest of the algorithm in more detail : identify not MACs but higher level operations (filtering, correlation, whatever) that can be re-used. Build cores for each of these operations, with reusability (e.g. filters with different selectable coefficient sets) and soon you may find relatively few multiplexers are required between relatively large cores...

Also, is any strength reduction possible? I had some cases where multiplications in loops were required to generate quadratics (and higher). Unrolling them, I could iteratively generate them without multiplication : I was quite pleased with myself the day I built a Difference Engine on FPGA!

Without knowing the application I can't give more details but some such analysis is likely to make some major simplifications possible.

Also - since it sounds as if you don't have a definite platform in mind - consider if you can partition across multiple FPGAs ... take a look at this board or this one which offer multiple FPGAs in a convenient platform. They also have a board with 100 Spartan-3 devices...

(p.s. I was disappointed when the software guys closed this other question - I think it's at least as appropriate there)

Edit : re your edit - I think you are starting to get there. If all the multiplier inputs are either FFT outputs, or "not-filter" coefficients, you are starting to see the sort of regularity you need to exploit. One input to each multiplier connects to an FFT output, the other input to a coefficient ROM (BlockRam implemented as a constant array).

Sequencing different FFT operations through the same FFT unit will automatically sequence the FFT outputs past this multiplier. Sequencing the correct coefficients into the other MPY input is now "merely" a matter of organising the correct ROM addresses at the correct time : an organisational problem, rather than a huge headache of MUXes.

On performance : I think Dave Tweed was being needlessly pessimistic - the FFT taking n*log(n) operations, but you get to choose O(n) butterfly units and O(logN) cycles, or O(logN) units and O(n) cycles, or some other combination to suit your resource and speed goals. One such combination may make the post-FFT multiply structure much simpler than others...

Related Solutions

Electronic – Pipelining and clock frequency issue

Your design will not function correctly if it runs at 100 MHz but is only spec'd (by the tools) to run at 50 MHz. If it does, then it's a one-off miracle that wouldn't work when you make a change and rerun the tools. Don't do it. Don't even do it if your clock is 100 MHz and the tools tell you the design can run at 99.5 MHz.

To solve your problem you can either write a simple 'divide by power of 2' clock divider to reduce the clock frequency (something like this in Verilog):

reg [n:0] count; 

always @(posedge CLK_100) begin
  count <= count + 1;
end

BUFG bg_0 (.I(count[m]), .O(CLK_DIV));

(where 'm' <= 'n' and 'bufg' is a global clock buffer, and must be used for synchronous designs) or use a Digital Clock Manager (DCM).

Hopefully that solves your pipelining issues as well unless you absolutely have to run the entire design at 100 MHz. Other than pipelining you can consider using FIFOs if you have part of the design running at 50 MHz and the other at 100 MHz, but you'll have to say a bit more about what you're doing to get more meaningful help here.

Electronic – How to identify areas of a FPGA design that use the most resources and area

I cross posted this question on the Xilinx Forum here: http://forums.xilinx.com/t5/Implementation/How-to-determine-what-part-of-the-design-consumes-the-most/td-p/393247

This answer is largely based on the comments there. Thanks to Deepika, Sikta and Gabor.

First, enable 'Generate Detailed MAP Report' in the map process properties (-detail).

Then, open the Design Summary, and navigate to Module Level Utilization. Here is the complete hierarchy, showing exclusive and inclusive design utilization.

Each line will show a number pair like 0/5392. This means that that module contains zero of that specific element, but that module and all of its sub-modules contain a total of 5392 elements.

Here is my output (partially expanded)

When working on reducing the size, Gabor recommends switching to a larger FPGA in the synth tools so that it can completely map even when it's too large to fit in your current FPGA, and it will make the tools run faster.

Best Answer

Related Solutions

Electronic – Pipelining and clock frequency issue

Electronic – How to identify areas of a FPGA design that use the most resources and area

Related Topic