Electronic – 32-way Mux Produces Horrible Timing Problems

timingverilog

I'm coding a 32 way mux in verilog.

The input is a counter which counts from 0 to 31, incrementing each clock cycle. Each counter value selects a different slice of a vector as an output.

In my state machine process, the counter is generated as follows:

    // Complicated stuff!
    if (counter < C_NUM_CYCLES-1) begin
       counter       <= counter + 1;
    end

Using the count, I select a slice of 32 bits from a 1024 bit vector. Input 0 selects the LS 32 bits of the vector, 1 selects the next 32 and so on. This slice is generated in a separate clocked process:

    S1_input_val <= 1;
    S1_input     <= input_vector_i[(counter_z[0]+1)*(C_S1_INPUT_LENGTH)-1 -: C_S1_INPUT_LENGTH];

This 32 bit signal is used in another entity which I wish to use to process the entire 1024 bit number, one 32bit slice at a time.

It works perfectly fine in simulation. Implementation produces a horrible timing report:

                                    Delay   Cumulitive
CARRY4 (Prop_carry4_CI_CO[3])   (r) 0.117   6.879   Site: SLICE_X54Y3   counter_reg[12]_i_1/CO[3]
CARRY4 (Prop_carry4_CI_CO[3])   (r) 0.117   6.996   Site: SLICE_X54Y4   counter_reg[16]_i_1/CO[3]
...
CARRY4 (Prop_carry4_CI_CO[3])   (r) 0.114   39.719  Site: SLICE_X55Y148 counter_reg[1016]_i_1/CO[3]
CARRY4 (Prop_carry4_CI_O[1])    (r) 0.334   40.053  Site: SLICE_X55Y149 counter_reg[1020]_i_1/CO[3]
Arrival Time                                40.053  <- ns!

This issue remains even after adding a number of stages of registering, and everything is running in clocked processes. It seems like just the mux select mechanism being generated by the tools is causing this. I don't know how I can get in there to add in registering stages.

Is this just an inherent limitation of the hardware, or is there an issue with my mux and the way I am implementing it? can I do it better? How do I improve the timing?

One of the biggest things that baffles me is what is the signal referred to as counter_reg in the timing report. It goes up to 1024, whereas the actual counter only goes up to 32. I've spent some time digging through the FPGA editor but haven't had much success.

Tool is Vivado 2014.1.

Edit:

Just thought that maybe the signals going into the next entity were not being registered ASAP. That's not the case, this is the code on the receiving end of the inputs:

// Register in inputs
always@(posedge clk) begin        
    if(rst) begin
      vector_i     <= 0;
      vector_val_i <= 0;
    end 
    else begin
      if (input_vector_val == 1) begin
          vector_i     <= input_vector;
          vector_val_i <= input_vector_val;
      end
      else begin
          vector_i     <= 0;
          vector_val_i <= 0;
      end
    end
end 

All relevant code:

parameter C_S1_INPUT_LENGTH      = 32,
parameter C_NUM_CYCLES_BITS      = 5

...

reg [C_NUM_CYCLES_BITS-1:0]        counter;
reg [C_NUM_CYCLES_BITS-1:0]        counter_z [3:0];  
reg [C_NUM_STATES-1:0]             current_state;
reg [C_S1_INPUT_LENGTH-1:0]        S1_input;
reg                                S1_input_val;   

...
  // Latch in inputs
  always@(posedge clk) begin        
      if(rst) begin
        current_state      <= S_IDLE;
        counter            <= 0;
        input_vector_i     <= 0;
        S2_input_val       <= 0;
      end 
      else begin
        case (current_state)
          S_IDLE:
            begin
                counter       <= 0;
                S2_input_val  <= 0;
                if (input_vector_val == 1) 
                  begin
                    current_state <= S_STAGE_ONE;
                    input_vector_i <= input_vector;
                  end
                else
                  current_state <= S_IDLE;
            end
          S_STAGE_ONE: 
            begin
                S2_input_val <= 0;
                if (counter < C_NUM_CYCLES-1) begin
                  counter       <= counter + 1;
                end

                if (S1_valid == ones) 
                  begin
                    current_state <= S_STAGE_TWO;
                    S2_input_val  <= 1;
                  end
                else 
                  current_state <= S_STAGE_ONE;
            end
          S_STAGE_TWO:
            // not relevant
          default:
            $display("wrong state!");
        endcase; 
      end
  end // always@ (posedge clk)

  always@(posedge clk) begin        
      if(rst) begin
        S1_input_val <= 0;
        S1_input     <= 0;
      end 
      else begin
        if (current_state == S_STAGE_ONE) begin
            S1_input_val <= 1;
            S1_input     <= input_vector_i[(counter_z[0]+1)*(C_S1_INPUT_LENGTH)-1 -: C_S1_INPUT_LENGTH];
        end
        else begin
            S1_input     <= 0;
            S1_input_val <= 0;
        end
      end
  end 

 always@(posedge clk) begin
     counter_z[0] <= counter;
     counter_z[1] <= counter_z[0];
     counter_z[2] <= counter_z[1];
     counter_z[3] <= counter_z[2];
 end 

Best Answer

It takes a lot of LUTs to build a large mux. For example, if you have 6-input LUTs, you can do a 4:1 mux in one LUT, but it takes 11 LUTs to do a 1-bit 32:1 mux.

Your counter is getting replicated (as counter_reg) so that the fanout on any given bit is not excessive. (Although I'm not really sure where the 1024 comes from.)

Since you don't really need "random" access to the sub-fields of input_vector_i — just sequential access — have you considered using a shift register instead?

Related Topic