Newly written data at the rising-edge is available directly after this edge only at the same port. Actually, the data input is internally forwarded to the data output of the same RAM port. Also called WRITE_FIRST
mode.
But, it is never forwarded to the output of the other RAM port, regardless of the specified WRITE_MODE
. It will be available for reading (of course at a another rising edge) after the internal write to the memory has been completed. In your example it is just the next rising clock edge, because the internal write time is always smaller (faster) than the minimum allowed clock period.
This behavior is described in XAPP 463 Using Block RAM in Spartan-3 Generation FPGAs in section Dual-Port RAM Conflicts and Resolution. The given example there uses different clocks, but is also applies whe the same clock is used for both ports.
This behaviour is still the same in current FPGAs from Xilinx and Altera.
The forwarding to the other RAM port has to be done by your one with surrounding logic.
You have a module which must process the data multiple times in a row to complete the conversion. In the 'pipelined' example, you simply feed through each in turn. This will have some latency as the data takes some time to go from the input to the output. However if the blocks are truly pipelined, this is just a latency - you can feed in a new data word on each cycle, and there will be multiple samples in the pipeline stage at any given point allowing for the same throughput in the end.
Your 'parallel' case isn't really a parallel case. It is basically the same pipelined case, but instead of sticking them one after another, you end up with extra logic to distribute the incoming data to each block, and then presumably each block has more logic to feed its output back through itself enough times to complete the conversion. At the end you then have to recombine them all. It is basically an ugly method of doing a pipelined calculation.
I am not sure where you get the idea that your pipeline must have 2,4,8,16 units? If you need to process the data through the module 7 times, you would simply stick 7 in a row in the pipeline - each one operates on the output of the last so it doesn't matter if it isn't a power of two length.
A truly parallel version, would be one where the calculation can be split into partial operations. Say for example you wanted to multiply two 16bit numbers, but only had an 8x8 multipler block which takes one clock cycle to complete. You could stick 4 in series and have some accumulation (this would be a pipelined operation), or you could add multiple instances of the multiplier and put them in parallel. In parallel the result would have 1 clock cycle latency, in pipelined (series) it would have 4 cycles latency. This comes at the cost of using 4 times as much logic.
Another example of true parallel behaviour is if you need to process multiple words at once, and faster than one block could handle. Say you had a block which took a data word and encrypted it. The block can only handle one word at its input in each clock cycle. Now what if your incoming data stream consists of four words which arrive all in the same clock cycle, but your encryption block can only handle one at a time. The throughput of the encryption module is 1/4 of what is required. Now if you put four blocks in parallel, you can now process each of the four words at the same time allowing for the required throughput - again at the expense of requiring four times as much logic.
There is one case where the second approach is actually justified. Say you need to process each word through the calculation stage 8 times, but because of the size of the calculation, you only have room in the FPGA for say, 3 passes, then you would need a way to reuse resource. You are trying to break the calculation to use fewer blocks.
In this situation, yes, having logic to feed back through the same block multiple times is quite advantageous. This allows you to reuse the same module and so much less logic resources to process the calculation. However this comes at the expense of throughput. If you need to feed the same data word through the same block 8 times, then your throughput is reduced to one eighth - because while you are doing that, no new words can enter the block.
Having room for additional blocks (say 3) would allow you to perform the calculation in parallel for three data words at a time. You instantiate three copies of your single block circuit and add some additional logic to determine when it is time for a new word to enter each of the blocks. This in turn gains back some performance - it is now 3/8ths of what it was.
I can update the answer with some diagrams if needed, but hopefully the explanation is fairly clear.
Best Answer
Well there are at least two scenarios where I would opt for manual retiming:
Where I know there is a specific optimal geometry, for example, a logic tree, and I don't want the synthesizer to do this alone since it could make a suboptimal election.
Synthesis running times can be long. I may prefer to make these decisions alone instead of letting the tool take them, where I may have to check what it did and possibly rerun synthesis.