I wonder if there is another way of looking at the problem?
Playing off your estimation of 512 FFT operations (64 point each) and 42k MAC operations... I presume this is what you need for one pass through the algorithm?
Now you have found an FFT core using 4 DSP units ... but how many clock cycles does it take per FFT? (throughput, not latency)? Let's say 64, or 1 cycle per point. Then you have to complete those 42k Mac operations in 64 cycles - perhaps 1k MACs per cycle, with each MAC handling 42 operations.
Now it is time to look at the rest of the algorithm in more detail : identify not MACs but higher level operations (filtering, correlation, whatever) that can be re-used. Build cores for each of these operations, with reusability (e.g. filters with different selectable coefficient sets) and soon you may find relatively few multiplexers are required between relatively large cores...
Also, is any strength reduction possible? I had some cases where multiplications in loops were required to generate quadratics (and higher). Unrolling them, I could iteratively generate them without multiplication : I was quite pleased with myself the day I built a Difference Engine on FPGA!
Without knowing the application I can't give more details but some such analysis is likely to make some major simplifications possible.
Also - since it sounds as if you don't have a definite platform in mind - consider if you can partition across multiple FPGAs ... take a look at this board or this one which offer multiple FPGAs in a convenient platform. They also have a board with 100 Spartan-3 devices...
(p.s. I was disappointed when the software guys closed this other question - I think it's at least as appropriate there)
Edit : re your edit - I think you are starting to get there. If all the multiplier inputs are either FFT outputs, or "not-filter" coefficients, you are starting to see the sort of regularity you need to exploit. One input to each multiplier connects to an FFT output, the other input to a coefficient ROM (BlockRam implemented as a constant array).
Sequencing different FFT operations through the same FFT unit will automatically sequence the FFT outputs past this multiplier. Sequencing the correct coefficients into the other MPY input is now "merely" a matter of organising the correct ROM addresses at the correct time : an organisational problem, rather than a huge headache of MUXes.
On performance : I think Dave Tweed was being needlessly pessimistic - the FFT taking n*log(n) operations, but you get to choose O(n) butterfly units and O(logN) cycles, or O(logN) units and O(n) cycles, or some other combination to suit your resource and speed goals. One such combination may make the post-FFT multiply structure much simpler than others...
If a, b, c
are of type std_logic_vector(31 downto 0)
,
then, c := a + b;
will give the 32 bit answer in c
(without carry) as you required.
If you want 33 bit answer in c (where c
is std_logic_vector(32 downto 0)
)
Then c := ('0' & a) + ('0' & b)
will give the 33 bit answer.
But you will need ieee.std_logic_unsigned
package for adding std_logic_vector
using +
operator.
Best Answer
I've done this a few times myself.
Generally, the design tools will choose between a fabric implementation and a DSP slice based on the synthesis settings.
For instance, for Xilinx ISE, in the synthesis process settings, HDL Options, there is a setting "-use_dsp48" with the options: Auto, AutoMax, Yes, No. As you can imagine, this controls how hard the tools try to place DSP slices. I once had a problem where I multiplied an integer by 3, which inferred a DSP slice - except I was already manually inferring every DSP slice in the chip, so the synth failed! I changed the setting to No, because I was already using every dsp slice.
This is probably a good rule of thumb (I just made up): if your design is clocked at less than 50 MHz, and you're probably going to use less than 50% of the DSP slices in the chip, then just use the *, +, and - operators. this will infer DSP slices with no pipeline registers. This really limits the top speed. (I have no idea what happens when you use division)
However, if it looks like you're going to run the slices closer to the max speed of the DSP slice (333 MHz for Spartan 6 normal speed grade) Of you're going to use all of the slices, you should manually infer them.
In this case, you have two options.
Option 1: manually use the raw DSP instantiation template. Option 2: use a IP block from Xilinx Core Generator. ( I would use this option. At the same time, you will learn all about core gen, which will help in the future)
Before you do either of these, read the first couple of pages of the DSP slice user guide. In the case of the Spartan 6, (DSP48A1), that would be Xilinx doc UG389: http://www.xilinx.com/support/documentation/user_guides/ug389.pdf
Consider the Core Generator option first. I usually create a testing project in Core Generator for the part I'm working with, where I create any number of IP blocks just to learn the system. Then, when I'm ready to add one to my design in ISE, I right click in the Design Hierarchy, click new source, and select "IP (CORE Generator & Architecture Wizard)" so that I can edit and regenerate the block directly from my project.
In Core gen, take a look at the different IP blocks you can choose from - there are a few dozen, most of which are pretty cool.
The Multiplier Core is what you should look at first. Check out every page, and click the datasheet button. The important parts are the integer bit widths, the pipeline stages (latency) and any control signals. This produces the simplest possible block by taking away all the ports you don't need.
When I was building a 5 by 3 order IIR filter last year, I had to use the manual instantiation template since I was building a very custom implementation, with 2 DSP slices clocked 4x faster than the sample rate. It was a total pain.