I wonder if there is another way of looking at the problem?
Playing off your estimation of 512 FFT operations (64 point each) and 42k MAC operations... I presume this is what you need for one pass through the algorithm?
Now you have found an FFT core using 4 DSP units ... but how many clock cycles does it take per FFT? (throughput, not latency)? Let's say 64, or 1 cycle per point. Then you have to complete those 42k Mac operations in 64 cycles - perhaps 1k MACs per cycle, with each MAC handling 42 operations.
Now it is time to look at the rest of the algorithm in more detail : identify not MACs but higher level operations (filtering, correlation, whatever) that can be re-used. Build cores for each of these operations, with reusability (e.g. filters with different selectable coefficient sets) and soon you may find relatively few multiplexers are required between relatively large cores...
Also, is any strength reduction possible? I had some cases where multiplications in loops were required to generate quadratics (and higher). Unrolling them, I could iteratively generate them without multiplication : I was quite pleased with myself the day I built a Difference Engine on FPGA!
Without knowing the application I can't give more details but some such analysis is likely to make some major simplifications possible.
Also - since it sounds as if you don't have a definite platform in mind - consider if you can partition across multiple FPGAs ... take a look at this board or this one which offer multiple FPGAs in a convenient platform. They also have a board with 100 Spartan-3 devices...
(p.s. I was disappointed when the software guys closed this other question - I think it's at least as appropriate there)
Edit : re your edit - I think you are starting to get there. If all the multiplier inputs are either FFT outputs, or "not-filter" coefficients, you are starting to see the sort of regularity you need to exploit. One input to each multiplier connects to an FFT output, the other input to a coefficient ROM (BlockRam implemented as a constant array).
Sequencing different FFT operations through the same FFT unit will automatically sequence the FFT outputs past this multiplier. Sequencing the correct coefficients into the other MPY input is now "merely" a matter of organising the correct ROM addresses at the correct time : an organisational problem, rather than a huge headache of MUXes.
On performance : I think Dave Tweed was being needlessly pessimistic - the FFT taking n*log(n) operations, but you get to choose O(n) butterfly units and O(logN) cycles, or O(logN) units and O(n) cycles, or some other combination to suit your resource and speed goals. One such combination may make the post-FFT multiply structure much simpler than others...
Problem with Design #1
I have noticed that you must specify the two ports in two separate processes for XST to infer dual-port RAM - if you don't you won't get the two ports. Separate processes is also how Xilinx suggests infering Dual-port RAM in XST User Guide. Hence your Design #1 will only infer single-port ram.
You can see my general VHDL for infering dual-port RAM with XST at the bottom of this post. (Details: http://www.fpga-dev.com/infering-dual-port-blockram-with-xst/)
Problem with Design #2
In your Design #2, you register the addres twice, probably unintentionally. <=
signal assignments are made at the end of the process, not immediately. This code is equivalent to yours, only with simpler signal names:
-- sequential context (A, B, C are signals):
if rising_edge(clk) then
B <= A;
C <= B;
end if;
Here C <= B;
will not assign to C what was assigned to B on the previous line, since that assignment only takes effect at the end of the process. If the signals are bits and the stimuli is a pulse on A
, this would be the result of the above code:
clk _|"|_|"|_|"|_|"|_|"|_|"|
A ______|"""|_____________
B __________|"""|_________
C ______________|"""|_____
Declaring B
a variable
instead and assigning with :=
will assign immediately:
-- sequential context (A, C are signals; B is variable):
if rising_edge(clk) then
B := A;
C <= B;
end if;
yielding
clk _|"|_|"|_|"|_|"|_|"|_|"|
A ______|"""|_____________
B __________|"""|_________
C __________|"""|_________
Infering dual-port BlockRam with XST
(More details on this at http://www.fpga-dev.com/infering-dual-port-blockram-with-xst/.)
Below is my parameterized module for generic dual-port RAM. It will successfully infer dual-port RAM, as desired, with XST.
(Remove the write enable-signals and write logic to get ROM instead of RAM.)
Specify width and depth with width
and highAddr
(one less than desired depth) generics.
library IEEE;
use IEEE.STD_LOGIC_1164.all;
entity genRAM is
generic(
width : integer;
highAddr : integer -- highest address (= size-1)
);
port(
-- Two sets of ports (A and B), each set having ports Adress, Data in,
-- Data out and Write enable:
Aaddr : in integer range 0 to highAddr := 0;
ADI : in std_logic_vector(width-1 downto 0) := (others => '0');
ADO : out std_logic_vector(width-1 downto 0) := (others => '0');
AWE : in std_logic := '0';
Baddr : in integer range 0 to highAddr := 0;
BDI : in std_logic_vector(width-1 downto 0) := (others => '0');
BDO : out std_logic_vector(width-1 downto 0) := (others => '0');
BWE : in std_logic := '0';
clk : in std_logic
);
end genRAM;
architecture arch of genRAM is
subtype TmemWord is bit_vector(width-1 downto 0);
type Tmem is array(0 to highAddr) of TmemWord;
shared variable memory: Tmem;
process(clk) is
begin
if (rising_edge(clk)) then
ADO <= To_StdLogicVector(memory(Aaddr));
if (AWE = '1') then
memory(Aaddr) := To_bitvector(std_logic_vector(ADI));
end if;
end if;
end process;
process(clk) is
begin
if (rising_edge(clk)) then
BDO <= To_StdLogicVector(memory(Baddr));
if (BWE = '1') then
memory(Baddr) := To_bitvector(std_logic_vector(BDI));
end if;
end if;
end process;
end arch;
The code above implements read-first behavior. That means that if address 0x00
contains 0xcafe
and you write 0xbabe
to 0x00
, the cycle after the write will display 0xcafe
on the data-out port ("data is read to output port before being written to memory").
If you desire write-first behaviour, change order of the reading and writing for both processes, below is how it would be for port A:
-- excerpt for write-first behaviour:
if (AWE = '1') then
memory(Aaddr) := To_bitvector(std_logic_vector(ADI));
end if;
ADO <= To_StdLogicVector(memory(Aaddr));
In the above case, data-out would display 0xbabe
one cycle after the write ("data is written to memory before reading memory contents to output port").
Best Answer
Inferring DSP slices is actually pretty straightforward. The Spartan 6 has DSP48A1 DSP slices, so take a look at Xilinx UG389. Page 15 has a block diagram of the DSP slice. XST is quite good about inferring DSP slices. Just make sure to get all of the pipeline registers in there for maximum performance, and make sure all of your bit widths are no wider than those shown on the block diagram. Here is a simple multiplier with AXI stream interfaces that infers a DSP slice on a Spartan 6: https://github.com/alexforencich/verilog-dsp/blob/master/rtl/dsp_mult.v .
Also take a look at the XST user guide, ug627, pages 98-121. One rather annoying thing to note: the pipelined multipliers in that section will not synthesize to completely pipelined DSP48 slices (they will probably infer slices, but you will get a performance penalty as the registers will not necessarily be in the correct locations). For example, the coding examples and block diagram on pages 104-108 all show a multiplier with one pipeline register before and three after. When I first looked at that, I assumed that XST would be smart enough to move the registers to match the actual DSP slice (it is possible to move registers "through" the multiplier without changing the operation). It isn't. You should add registers (with only synchronous resets!) exactly as shown in the DSP slice manual in order for XST to infer a DSP slice properly with the pipeline registers in the right places for maximum performance (note that this registers are implemented internally in the DSP slice; adding all of the pipeline registers shown in the DSP slice user guide will only result in a latency penalty - they will not consume fabric flip-flops). I would recommend printing out the DSP slice block diagram and tacking it up on the wall as a reference. And also don't forget to look at the synthesis logs to make sure the DSP slices are pulling in the pipeline registers correctly.
As far as a listing of documentation, there isn't one good place for everything (FPGAs, IP cores, software, etc.). For just the features of a single FPGA, take a look at the product page. For example, http://www.xilinx.com/products/silicon-devices/fpga/spartan-6.html#documentation . Make sure to select 'user guides', not 'datasheets'. That should give you a pretty comprehensive list of the Spartan 6 documentation.