Electronic – FPGA firmware design: How big is too big

fpgavhdlxilinx

I have a particularly large signal processing transform that needs to be ported from matlab to VHDL. It definitely requires some kind of resource sharing. A bit of calculation gave me the following:

  • 512 ffts of 64-points
  • 41210 multiply-add operations

Considering the largest Virtex 6 FPGA has ~2000 DSP48E blocks, I know that I can resource-share in order to re-use the resources multiple times. Execution time isn't really an issue, the processing time can take relatively long in FPGA terms.

Looking at resource usage, using radix-2 lite architecture gets me 4dsp blocks/FFT operation = 2048 DSP blocks, a total of ~43k. biggest Virtex FPGA has 2k blocks, or 20 operations/mux.

Obviously including such large muxes into the fabric is also going to take up slices. Where do I find the upper end of this limit? I can't infinitely share the FPGA resources. Is 41210 multipliers too big? How do I calculate what is too big?

I've also looked at other resources (Slices, Brams, etc). Radix-2 Lite also gives 4 x 18k brams/fft = 2048 brams biggest Xilinx FPGA contains 2128 Brams. very borderline. I'm concerned that my design is just too big.


UPDATE:

Some more info on the design itself. I can't go into detail, but here is what I can give:

Initial conditions -> 512 ffts -> 40k multipliers ---------|----> output data to host 

                 ^------re-calculate initial conditions----|

output datarate spec: "faster than the matlab simulation"

calculations wise, this is where I am:

FFT stage: easy. I can implement 1/2/4/8 FFTs, store the results in SDRAM and access later. Relatively small, even if it takes long, it's ok. using radix-2 lite I can get 2 DSP48Es and 2 18k BRAMS/FFT. streaming gives 6 DSP48Es 0BRAMS/FFT. in either case, 64 point FFT is small in FPGA resource terms.

Multipliers: this is my problem. The multiplication inputs are taken from either lookup tables or FFT data. It really is just a whole bunch of multiply-adds. There isn't much to optimise. Not a filter, but has characteristics similar to a filter.

Considering resource sharing on the FPGA, maths works out as follows:
One LUT-6 can be used as a 4-way mux. The formula for an N-way, M bit mux is as follows:

N*M/3 = number of luts, or N*M/12 = slices (4 LUTS/slice).

crunching the numbers for my implementation does not give good results. 90% of the virtix-6 family doesn't have enough slices to resource-share their DSPs in order to perform 40k operations.

Best Answer

I wonder if there is another way of looking at the problem?

Playing off your estimation of 512 FFT operations (64 point each) and 42k MAC operations... I presume this is what you need for one pass through the algorithm?

Now you have found an FFT core using 4 DSP units ... but how many clock cycles does it take per FFT? (throughput, not latency)? Let's say 64, or 1 cycle per point. Then you have to complete those 42k Mac operations in 64 cycles - perhaps 1k MACs per cycle, with each MAC handling 42 operations.

Now it is time to look at the rest of the algorithm in more detail : identify not MACs but higher level operations (filtering, correlation, whatever) that can be re-used. Build cores for each of these operations, with reusability (e.g. filters with different selectable coefficient sets) and soon you may find relatively few multiplexers are required between relatively large cores...

Also, is any strength reduction possible? I had some cases where multiplications in loops were required to generate quadratics (and higher). Unrolling them, I could iteratively generate them without multiplication : I was quite pleased with myself the day I built a Difference Engine on FPGA!

Without knowing the application I can't give more details but some such analysis is likely to make some major simplifications possible.

Also - since it sounds as if you don't have a definite platform in mind - consider if you can partition across multiple FPGAs ... take a look at this board or this one which offer multiple FPGAs in a convenient platform. They also have a board with 100 Spartan-3 devices...

(p.s. I was disappointed when the software guys closed this other question - I think it's at least as appropriate there)

Edit : re your edit - I think you are starting to get there. If all the multiplier inputs are either FFT outputs, or "not-filter" coefficients, you are starting to see the sort of regularity you need to exploit. One input to each multiplier connects to an FFT output, the other input to a coefficient ROM (BlockRam implemented as a constant array).

Sequencing different FFT operations through the same FFT unit will automatically sequence the FFT outputs past this multiplier. Sequencing the correct coefficients into the other MPY input is now "merely" a matter of organising the correct ROM addresses at the correct time : an organisational problem, rather than a huge headache of MUXes.

On performance : I think Dave Tweed was being needlessly pessimistic - the FFT taking n*log(n) operations, but you get to choose O(n) butterfly units and O(logN) cycles, or O(logN) units and O(n) cycles, or some other combination to suit your resource and speed goals. One such combination may make the post-FFT multiply structure much simpler than others...