Electronic – How to analyse timing report for Xilinx FPGA

fpgatimingtiming-analysisxilinx

I'm trying to learn FPGA programming, my test project is a 5 stage pipelined MIPS CPU, which works.

Up until now I have been optimising for area utilisation, however this has caused a very slow clock speed (~50MHz).

I have been looking at the post map static timing report generated by ISE, but can't make a lot of sense of it. Below is the section for a single path (the slowest), I can't understand why this path would be so slow.

My questions:

1) If the timing delay is 80% routing, (as this report seems to indicate) can I improve this? If so, how?

2) How can I reduce the logic component of the timing.

3) What is meant by "source" and "destination", in the below example, opcode_out[1] is the source and finished[0] is the destination, however in my design these are never directly connected. One is set in the negative edge of the decode stage, the other is set in the positive edge of the execute stage.

4) In some places I have played with using non blocking assignments, this is not possible everywhere. What performance effects does this have? I've found mixed reports on this.

5) Finally, what is the likelihood of me getting my clock speed to 200MHz, given that it is currently struggling to reach 50Mhz?

Paths for end point XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1 (OLOGIC_X2Y2.D1), 131 paths 
 -------------------------------------------------------------------------------- 
 Slack (setup path):     -6.906ns (requirement - (data path - clock path skew + uncertainty)) 
   Source:               XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out_2 (FF) 
   Destination:          XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1 (FF) 
   Requirement:          5.000ns 
   Data Path Delay:      11.871ns (Levels of Logic = 6) 
   Clock Path Skew:      0.000ns 
   Source Clock:         XLXN_200 falling at 5.000ns 
   Destination Clock:    XLXN_200 rising at 10.000ns 
   Clock Uncertainty:    0.035ns 

   Clock Uncertainty:          0.035ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE 
     Total System Jitter (TSJ):  0.070ns 
     Total Input Jitter (TIJ):   0.000ns 
     Discrete Jitter (DJ):       0.000ns 
     Phase Error (PE):           0.000ns 

   Maximum Data Path at Slow Process Corner: XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out_2 to XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1 
     Location             Delay type         Delay(ns)  Physical Resource 
                                                Logical Resource(s) 
     -------------------------------------------------  ------------------- 
     SLICE_X40Y70.DQ      Tcko                  0.408   XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out<2> 
                                                XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out_2 
     SLICE_X41Y70.D5      net (fanout=8)     e  0.759   XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out<2> 
     SLICE_X41Y70.DMUX    Tilo                  0.313   XLXI_30/XLXI_5/XLXI_3/execute_inst/_n0402<5>1_FRB 
                                                XLXI_30/XLXI_5/XLXI_3/execute_inst/Mmux_func_in[5]_PWR_69_o_mux_83_OUT22_SW0 
     SLICE_X39Y70.B5      net (fanout=1)     e  0.377   N118 
     SLICE_X39Y70.B       Tilo                  0.259   XLXI_30/XLXI_5/XLXI_3/execute_inst/rd_value_out_wire<31> 
                                                XLXI_30/XLXI_5/XLXI_3/execute_inst/Mmux_func_in[5]_PWR_69_o_mux_83_OUT22 
     SLICE_X41Y72.A6      net (fanout=3)     e  0.520   XLXI_30/XLXI_5/XLXI_3/execute_inst/Mmux_func_in[5]_PWR_69_o_mux_83_OUT22 
     SLICE_X41Y72.A       Tilo                  0.259   XLXI_30/XLXI_5/XLXI_3/decode_inst/immediate_out<4> 
                                                XLXI_30/XLXI_5/XLXI_3/execute_inst/GND_74_o_GND_74_o_equal_146_o<5>1 
     SLICE_X41Y72.C5      net (fanout=19)    e  0.547   XLXI_30/XLXI_5/XLXI_3/execute_inst/GND_74_o_GND_74_o_equal_146_o<5>1 
     SLICE_X41Y72.C       Tilo                  0.259   XLXI_30/XLXI_5/XLXI_3/decode_inst/immediate_out<4> 
                                                XLXI_30/XLXI_5/XLXI_3/execute_inst/func_in[5]_PWR_69_o_equal_125_o<5>1 
     SLICE_X31Y70.A3      net (fanout=23)    e  0.934   XLXI_30/XLXI_5/XLXI_3/execute_inst/func_in[5]_PWR_69_o_equal_125_o 
     SLICE_X31Y70.A       Tilo                  0.259   XLXI_30/XLXI_5/XLXI_3/decode_inst/rd_out<3> 
                                                XLXI_30/XLXI_5/XLXI_3/execute_inst/_n0453 
     SLICE_X31Y70.B5      net (fanout=1)     e  0.359   XLXI_30/XLXI_5/XLXI_3/execute_inst/_n0453 
     SLICE_X31Y70.B       Tilo                  0.259   XLXI_30/XLXI_5/XLXI_3/decode_inst/rd_out<3> 
                                                XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_glue_set 
     OLOGIC_X2Y2.D1       net (fanout=2)     e  5.556   XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_glue_set 
     OLOGIC_X2Y2.CLK0     Todck                 0.803   XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1 
                                                XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1 
     -------------------------------------------------  --------------------------- 
     Total                                     11.871ns (2.819ns logic, 9.052ns route) 
                                                (23.7% logic, 76.3% route) 

Best Answer

1) Routing is always the dominant factor limiting timing. That is why a carry-lookahead is not really faster in FPGA, as the larger adder requires more delays which partly overcome the advantages. Your path has 6 level of logic, which is O.K. It would be very hard to put all paths below 6. However, some nets have a high fanout, which yields longer delays. You can try to duplicate some registers to cut the fanout, or try the Xilinx options.

2) By changing the logic equation... Delay in a slice is only affected by the path it takes, not the logic operation. Of course, the path it takes is dictated by the logic operation. However, You can see in the timing report that it takes around 0.250 to 0.300 ns per slice (plus the routing delays...).

3) Source and destination are exactly what they say. There is a path between opcode and finished. At the falling edge of clk, opcode becomes valid and it's new value propagates in your circuit. The path ends at the finished register and the propagation has to stabilize before the rising edge of clk to meet timings.

4) It has no influence if you use them appropriately. It's should be a coding choice to make the code easier to read and understand, but the two can describe the same circuit. Problems arise when people use them without understanding the impact.

5) What's your FPGA? If it's series 7, it will be hard but possible, otherwise, no way. Also, don't augment the clock drastically. When the constraints are too high, Xilinx freaks out and the results are untrusty. A design that works at 10ns with a slack of 1ns may fails with -2ns at a clock of 11ns. There is a breaking point where the synthesizer try too hard to meet timings, and fails drastically when trying to place-and-route the bigger design.

I would also suggest you remove the DDR clock. There is no reason to have DDR logic in a processor, use a twice as fast clock instead. Having DDR adds unnecessary constraints on what slice can contains which logic, and probably inflate your routing delays. By using a single clock, the placement will (hopefully) use the optimal slice for all registers.