CPU's are sequential processing devices. They break an algorithm up into a sequence of operations and execute them one at a time.
FPGA's are (or, can be configured as) parallel processing devices. An entire algorithm might be executed in a single tick of the clock, or, worst case, far fewer clock ticks than it takes a sequential processor. One of the costs to the increased logic complexity is typically a lower limit at which the device can be clocked.
Bearing the above in mind, FPGA's can outperform CPU's doing certain tasks because they can do the same task in less clock ticks, albeit at a lower overall clock rate. The gains that can be achieved are highly dependent on the algorithm, but at least an order of magnitude is not atypical for something like an FFT.
Further, because you can build multiple parallel execution units into an FPGA, if you have a large volume of data that you want to pass through the same algorithm, you can distribute the data across the parallel execution units and obtain further orders of magnitude higher throughput than can be achieved with even a multi-core CPU.
The price you pay for the advantages is power consumption and $$$'s.
So I think I found some answers to the problem and want to share them.
I started to simulate the GTXE2_CHANNEL hardmacro. The simulation is behaving as "false" as the hardware. So I tried to simulate the MGT in Verilog and used an instance template from here:
http://forums.xilinx.com/t5/7-Series-FPGAs/Using-v7gtx-as-sata-host-PHY-and-there-is-issue-bout-ALIGN/td-p/374203
This template simulates ElectricalIDLE conditions and OOB sequences nearly correct. So I started to diff both solutions:
TXPDELECIDLEMODE, which is a port to choose the behavior of TXElectricalIDLE is not working as expected. So now I'm using the synchronous mode.
PCS_RSVD_ATTR is a unconstrained bit_vector generic of 48 bit. If you have a look into the wrapper code of the secureip GTXE2_CHANNEL component, you will find a conversion from bit_vector => std_logic_vector => string
. Internally all generics are treated as DOWNTO ranged. So it's important to pass a DOWNTO constant to the GTXE2 generics!
So now you could ask why is he using to-ranged constants and generics?
Xilinx ISE up to the latest version 14.7 has a major bug in handling vectors of user defined types in unconstrained generics. The default direction of vectors is TO. If you are passing vectors of enums as DOWNTO to unconstrained generics into a component, ISE is reversing the vector elements and "emits" a TO ranged vector in the components !!
This is especially "funny" if the design hierarchy, which uses this generic, is not a balanced tree...
If you are using enums of 2 elements, the problem is not existent -> maybe this enum is mapped to a boolean.
Which bugs are left?
- TXComFinish is still not acknowledging the send OOB sequences.
- I have to investigate this two bug fixes in synthesis and measure the OOB sequences with a scope - this may last some days :)
Edit 1 - more bugs:
There is an other bug in the reset behavior of the GTXE2. If GTXE2 is used with output clock dividers set to 1 (TX_ and RX_RateSelection = "000") than the GTXE2 boots up and emits only 3 clock cycles (with wrong clock period) on TX_OutClock. After that TX_OutClock is 'X'. If you reset the GTXE2 after that wrong output it boots up a second time with now error and a correct clock on TX_OutClock.
Additionally to this bug, the GXTE2 ignores all assigned resets (CPLL as well as TX/RX_RESETs) until 'X' can be seen on TX_OutClock. So you MUST wait for circa 2.5 us to issue a reset.
If you are using clock dividers with 2 or 4 (8 and 16 are not tested yet) this problem will not occur.
Edit 2 - problems solved:
Solution for Bug 1:
I have added a timeout counter whose timeout depends on the current generation (clock frequency) and the current COM sequence which is to be send. If the timeout is reached I generate my own TXComFinished signal. Don't or the timeout signal with the original TXComFinished signal from GTX, because sometimes this signal is high while COMWAKE is to be send, but this finished strobe belongs still to the previous COMRESET sequence!
Solution for an other Bug:
RXElectricalIDLE is not glitch free! To solve this problem I added an filter element on this wire, which suppresses spikes on that line.
So currently my controller is running at SATA Gen1 with 1.5 GHz on a KC705 board with a SFP2SATA adapter and I think this question is solved.
Best Answer
Depending on what you want to learn, there are many approaches possible.
You say the fully parallel combinatorial design works, and fits into your FPGA. Result! Many students would stop there and write it up. However, it sounds like you feel that this is not in the spirit of the project.
Creating your own processor design from scratch would be a project 100x the size of what you are attempting, for a general purpose core at least. Using an existing VHDL processor core would perhaps be too easy? Designing an ALU with just the instructions needed for these calculations still sounds quite a large detour.
The first place I would look to start serialising the design is that divide by c squared. Division is an operation that's very expensive or impossible to do as full width look up tables. Bit-wise shift-subtract is perhaps the mainstream way. Look up COORDIC as an alternative way of mechanising it. You may also want to consider byte or nybble-wide shift and subtract, as an alternative method of implementation, with a latency and resource use somewhere between the two previous methods.
Maybe you could to look at implementing serial arithmetic as an exercise, on the grounds of saving space. Hold the variables in shift registers, and cycle them through a one bit ALU+carry, LSB first. All sorts of interesting state machine issues to solve.