I wonder if there is another way of looking at the problem?
Playing off your estimation of 512 FFT operations (64 point each) and 42k MAC operations... I presume this is what you need for one pass through the algorithm?
Now you have found an FFT core using 4 DSP units ... but how many clock cycles does it take per FFT? (throughput, not latency)? Let's say 64, or 1 cycle per point. Then you have to complete those 42k Mac operations in 64 cycles - perhaps 1k MACs per cycle, with each MAC handling 42 operations.
Now it is time to look at the rest of the algorithm in more detail : identify not MACs but higher level operations (filtering, correlation, whatever) that can be re-used. Build cores for each of these operations, with reusability (e.g. filters with different selectable coefficient sets) and soon you may find relatively few multiplexers are required between relatively large cores...
Also, is any strength reduction possible? I had some cases where multiplications in loops were required to generate quadratics (and higher). Unrolling them, I could iteratively generate them without multiplication : I was quite pleased with myself the day I built a Difference Engine on FPGA!
Without knowing the application I can't give more details but some such analysis is likely to make some major simplifications possible.
Also - since it sounds as if you don't have a definite platform in mind - consider if you can partition across multiple FPGAs ... take a look at this board or this one which offer multiple FPGAs in a convenient platform. They also have a board with 100 Spartan-3 devices...
(p.s. I was disappointed when the software guys closed this other question - I think it's at least as appropriate there)
Edit : re your edit - I think you are starting to get there. If all the multiplier inputs are either FFT outputs, or "not-filter" coefficients, you are starting to see the sort of regularity you need to exploit. One input to each multiplier connects to an FFT output, the other input to a coefficient ROM (BlockRam implemented as a constant array).
Sequencing different FFT operations through the same FFT unit will automatically sequence the FFT outputs past this multiplier. Sequencing the correct coefficients into the other MPY input is now "merely" a matter of organising the correct ROM addresses at the correct time : an organisational problem, rather than a huge headache of MUXes.
On performance : I think Dave Tweed was being needlessly pessimistic - the FFT taking n*log(n) operations, but you get to choose O(n) butterfly units and O(logN) cycles, or O(logN) units and O(n) cycles, or some other combination to suit your resource and speed goals. One such combination may make the post-FFT multiply structure much simpler than others...
Best Answer
No, you have misunderstood the concept of "aspect ratio".
Each chip has a fixed number of block RAMs in it; you can find this number in the datasheet.
Each of those block RAMs holds 18K bits, and can be configured to access those bits in a number of different ways, from 16K words of 1 bit each, to 512 words of 36 bits each. In the first case, you can read or write one bit at a time; in the latter case, you can read or write 36 bits at a time.
For your filter coefficients, you'll probably want to configure the block RAM as 1024 words of 18 bits each. You'll be able to have one filter for each block RAM in your device, and you'll be "wasting" most of the capacity of each block RAM.
There are a few tricks you can try to support more than one filter per block RAM:
If the addressing for all of the filters runs in lock-step, you could configure the block RAM as 512 words of 36 bits, and then split the 36-bit data bus into two 18-bit busses, sending the data to two separate filters.
If you don't otherwise need the dual-port feature of the block RAMs, you can use the second port to feed a second filter. Tie the address MSB low on one port and high on the other.
If both of the above conditions apply, you could conceivably use both techniques together to hold coefficients for four filters in each block RAM.