This file is commonly called a "bitstream". Xilinx devices' bitstreams have the extension '.bit' and are generated by a program called 'bitgen'. '.bit' files are binaries and those are generated by default; if you want the ASCII representation of the bitstream, run
bitgen -b <your design>.ncd
and then a '.rbt' file will be generated in addition to the '.bit' file.
For further information about generating Xilinx bistreams, see Chapter 15 in this user guide for ISE version 12.4:
http://www.xilinx.com/support/documentation/sw_manuals/xilinx12_4/devref.pdf
For more information about Virtex 4 bistreams specifically, see the configuration guide for that device:
http://www.xilinx.com/support/documentation/user_guides/ug071.pdf
I wonder if there is another way of looking at the problem?
Playing off your estimation of 512 FFT operations (64 point each) and 42k MAC operations... I presume this is what you need for one pass through the algorithm?
Now you have found an FFT core using 4 DSP units ... but how many clock cycles does it take per FFT? (throughput, not latency)? Let's say 64, or 1 cycle per point. Then you have to complete those 42k Mac operations in 64 cycles - perhaps 1k MACs per cycle, with each MAC handling 42 operations.
Now it is time to look at the rest of the algorithm in more detail : identify not MACs but higher level operations (filtering, correlation, whatever) that can be re-used. Build cores for each of these operations, with reusability (e.g. filters with different selectable coefficient sets) and soon you may find relatively few multiplexers are required between relatively large cores...
Also, is any strength reduction possible? I had some cases where multiplications in loops were required to generate quadratics (and higher). Unrolling them, I could iteratively generate them without multiplication : I was quite pleased with myself the day I built a Difference Engine on FPGA!
Without knowing the application I can't give more details but some such analysis is likely to make some major simplifications possible.
Also - since it sounds as if you don't have a definite platform in mind - consider if you can partition across multiple FPGAs ... take a look at this board or this one which offer multiple FPGAs in a convenient platform. They also have a board with 100 Spartan-3 devices...
(p.s. I was disappointed when the software guys closed this other question - I think it's at least as appropriate there)
Edit : re your edit - I think you are starting to get there. If all the multiplier inputs are either FFT outputs, or "not-filter" coefficients, you are starting to see the sort of regularity you need to exploit. One input to each multiplier connects to an FFT output, the other input to a coefficient ROM (BlockRam implemented as a constant array).
Sequencing different FFT operations through the same FFT unit will automatically sequence the FFT outputs past this multiplier. Sequencing the correct coefficients into the other MPY input is now "merely" a matter of organising the correct ROM addresses at the correct time : an organisational problem, rather than a huge headache of MUXes.
On performance : I think Dave Tweed was being needlessly pessimistic - the FFT taking n*log(n) operations, but you get to choose O(n) butterfly units and O(logN) cycles, or O(logN) units and O(n) cycles, or some other combination to suit your resource and speed goals. One such combination may make the post-FFT multiply structure much simpler than others...
Best Answer
A half-latch is a gate with positive feedback implemented with a weak pull-up transistor:
simulate this circuit – Schematic created using CircuitLab
When the input is actively driven, it overrides the signal coming from the weak pullup. When the input is in Z-state, the weak pullup can keep the logical "1" at the input (and "0" at the output) indefinitely. It will not keep the opposite state reliably, hence "half-latch".
Why would someone want a half-latch instead of a full latch? For some signals it doesn't make sense to be able to store both constants. For example, a D-flipflop can have
enable
input only latched high, andreset
input only latched low, otherwise it will simply be eliminated during synthesis. That's the kind of signals for which half-latches are used: they are either latched to default value, or driven by interconnect.