I'm trying to learn FPGA programming, my test project is a 5 stage pipelined MIPS CPU, which works.
Up until now I have been optimising for area utilisation, however this has caused a very slow clock speed (~50MHz).
I have been looking at the post map static timing report generated by ISE, but can't make a lot of sense of it. Below is the section for a single path (the slowest), I can't understand why this path would be so slow.
My questions:
1) If the timing delay is 80% routing, (as this report seems to indicate) can I improve this? If so, how?
2) How can I reduce the logic component of the timing.
3) What is meant by "source" and "destination", in the below example, opcode_out[1]
is the source and finished[0]
is the destination, however in my design these are never directly connected. One is set in the negative edge of the decode stage, the other is set in the positive edge of the execute stage.
4) In some places I have played with using non blocking assignments, this is not possible everywhere. What performance effects does this have? I've found mixed reports on this.
5) Finally, what is the likelihood of me getting my clock speed to 200MHz, given that it is currently struggling to reach 50Mhz?
Paths for end point XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1 (OLOGIC_X2Y2.D1), 131 paths
--------------------------------------------------------------------------------
Slack (setup path): -6.906ns (requirement - (data path - clock path skew + uncertainty))
Source: XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out_2 (FF)
Destination: XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1 (FF)
Requirement: 5.000ns
Data Path Delay: 11.871ns (Levels of Logic = 6)
Clock Path Skew: 0.000ns
Source Clock: XLXN_200 falling at 5.000ns
Destination Clock: XLXN_200 rising at 10.000ns
Clock Uncertainty: 0.035ns
Clock Uncertainty: 0.035ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
Total System Jitter (TSJ): 0.070ns
Total Input Jitter (TIJ): 0.000ns
Discrete Jitter (DJ): 0.000ns
Phase Error (PE): 0.000ns
Maximum Data Path at Slow Process Corner: XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out_2 to XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1
Location Delay type Delay(ns) Physical Resource
Logical Resource(s)
------------------------------------------------- -------------------
SLICE_X40Y70.DQ Tcko 0.408 XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out<2>
XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out_2
SLICE_X41Y70.D5 net (fanout=8) e 0.759 XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out<2>
SLICE_X41Y70.DMUX Tilo 0.313 XLXI_30/XLXI_5/XLXI_3/execute_inst/_n0402<5>1_FRB
XLXI_30/XLXI_5/XLXI_3/execute_inst/Mmux_func_in[5]_PWR_69_o_mux_83_OUT22_SW0
SLICE_X39Y70.B5 net (fanout=1) e 0.377 N118
SLICE_X39Y70.B Tilo 0.259 XLXI_30/XLXI_5/XLXI_3/execute_inst/rd_value_out_wire<31>
XLXI_30/XLXI_5/XLXI_3/execute_inst/Mmux_func_in[5]_PWR_69_o_mux_83_OUT22
SLICE_X41Y72.A6 net (fanout=3) e 0.520 XLXI_30/XLXI_5/XLXI_3/execute_inst/Mmux_func_in[5]_PWR_69_o_mux_83_OUT22
SLICE_X41Y72.A Tilo 0.259 XLXI_30/XLXI_5/XLXI_3/decode_inst/immediate_out<4>
XLXI_30/XLXI_5/XLXI_3/execute_inst/GND_74_o_GND_74_o_equal_146_o<5>1
SLICE_X41Y72.C5 net (fanout=19) e 0.547 XLXI_30/XLXI_5/XLXI_3/execute_inst/GND_74_o_GND_74_o_equal_146_o<5>1
SLICE_X41Y72.C Tilo 0.259 XLXI_30/XLXI_5/XLXI_3/decode_inst/immediate_out<4>
XLXI_30/XLXI_5/XLXI_3/execute_inst/func_in[5]_PWR_69_o_equal_125_o<5>1
SLICE_X31Y70.A3 net (fanout=23) e 0.934 XLXI_30/XLXI_5/XLXI_3/execute_inst/func_in[5]_PWR_69_o_equal_125_o
SLICE_X31Y70.A Tilo 0.259 XLXI_30/XLXI_5/XLXI_3/decode_inst/rd_out<3>
XLXI_30/XLXI_5/XLXI_3/execute_inst/_n0453
SLICE_X31Y70.B5 net (fanout=1) e 0.359 XLXI_30/XLXI_5/XLXI_3/execute_inst/_n0453
SLICE_X31Y70.B Tilo 0.259 XLXI_30/XLXI_5/XLXI_3/decode_inst/rd_out<3>
XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_glue_set
OLOGIC_X2Y2.D1 net (fanout=2) e 5.556 XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_glue_set
OLOGIC_X2Y2.CLK0 Todck 0.803 XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1
XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1
------------------------------------------------- ---------------------------
Total 11.871ns (2.819ns logic, 9.052ns route)
(23.7% logic, 76.3% route)
Best Answer
1) Routing is always the dominant factor limiting timing. That is why a carry-lookahead is not really faster in FPGA, as the larger adder requires more delays which partly overcome the advantages. Your path has 6 level of logic, which is O.K. It would be very hard to put all paths below 6. However, some nets have a high fanout, which yields longer delays. You can try to duplicate some registers to cut the fanout, or try the Xilinx options.
2) By changing the logic equation... Delay in a slice is only affected by the path it takes, not the logic operation. Of course, the path it takes is dictated by the logic operation. However, You can see in the timing report that it takes around 0.250 to 0.300 ns per slice (plus the routing delays...).
3) Source and destination are exactly what they say. There is a path between opcode and finished. At the falling edge of clk, opcode becomes valid and it's new value propagates in your circuit. The path ends at the finished register and the propagation has to stabilize before the rising edge of clk to meet timings.
4) It has no influence if you use them appropriately. It's should be a coding choice to make the code easier to read and understand, but the two can describe the same circuit. Problems arise when people use them without understanding the impact.
5) What's your FPGA? If it's series 7, it will be hard but possible, otherwise, no way. Also, don't augment the clock drastically. When the constraints are too high, Xilinx freaks out and the results are untrusty. A design that works at 10ns with a slack of 1ns may fails with -2ns at a clock of 11ns. There is a breaking point where the synthesizer try too hard to meet timings, and fails drastically when trying to place-and-route the bigger design.
I would also suggest you remove the DDR clock. There is no reason to have DDR logic in a processor, use a twice as fast clock instead. Having DDR adds unnecessary constraints on what slice can contains which logic, and probably inflate your routing delays. By using a single clock, the placement will (hopefully) use the optimal slice for all registers.