I'm coding a 32 way mux in verilog.
The input is a counter which counts from 0 to 31, incrementing each clock cycle. Each counter value selects a different slice of a vector as an output.
In my state machine process, the counter is generated as follows:
// Complicated stuff!
if (counter < C_NUM_CYCLES-1) begin
counter <= counter + 1;
end
Using the count, I select a slice of 32 bits from a 1024 bit vector. Input 0 selects the LS 32 bits of the vector, 1 selects the next 32 and so on. This slice is generated in a separate clocked process:
S1_input_val <= 1;
S1_input <= input_vector_i[(counter_z[0]+1)*(C_S1_INPUT_LENGTH)-1 -: C_S1_INPUT_LENGTH];
This 32 bit signal is used in another entity which I wish to use to process the entire 1024 bit number, one 32bit slice at a time.
It works perfectly fine in simulation. Implementation produces a horrible timing report:
Delay Cumulitive
CARRY4 (Prop_carry4_CI_CO[3]) (r) 0.117 6.879 Site: SLICE_X54Y3 counter_reg[12]_i_1/CO[3]
CARRY4 (Prop_carry4_CI_CO[3]) (r) 0.117 6.996 Site: SLICE_X54Y4 counter_reg[16]_i_1/CO[3]
...
CARRY4 (Prop_carry4_CI_CO[3]) (r) 0.114 39.719 Site: SLICE_X55Y148 counter_reg[1016]_i_1/CO[3]
CARRY4 (Prop_carry4_CI_O[1]) (r) 0.334 40.053 Site: SLICE_X55Y149 counter_reg[1020]_i_1/CO[3]
Arrival Time 40.053 <- ns!
This issue remains even after adding a number of stages of registering, and everything is running in clocked processes. It seems like just the mux select mechanism being generated by the tools is causing this. I don't know how I can get in there to add in registering stages.
Is this just an inherent limitation of the hardware, or is there an issue with my mux and the way I am implementing it? can I do it better? How do I improve the timing?
One of the biggest things that baffles me is what is the signal referred to as counter_reg
in the timing report. It goes up to 1024, whereas the actual counter only goes up to 32. I've spent some time digging through the FPGA editor but haven't had much success.
Tool is Vivado 2014.1.
Edit:
Just thought that maybe the signals going into the next entity were not being registered ASAP. That's not the case, this is the code on the receiving end of the inputs:
// Register in inputs
always@(posedge clk) begin
if(rst) begin
vector_i <= 0;
vector_val_i <= 0;
end
else begin
if (input_vector_val == 1) begin
vector_i <= input_vector;
vector_val_i <= input_vector_val;
end
else begin
vector_i <= 0;
vector_val_i <= 0;
end
end
end
All relevant code:
parameter C_S1_INPUT_LENGTH = 32,
parameter C_NUM_CYCLES_BITS = 5
...
reg [C_NUM_CYCLES_BITS-1:0] counter;
reg [C_NUM_CYCLES_BITS-1:0] counter_z [3:0];
reg [C_NUM_STATES-1:0] current_state;
reg [C_S1_INPUT_LENGTH-1:0] S1_input;
reg S1_input_val;
...
// Latch in inputs
always@(posedge clk) begin
if(rst) begin
current_state <= S_IDLE;
counter <= 0;
input_vector_i <= 0;
S2_input_val <= 0;
end
else begin
case (current_state)
S_IDLE:
begin
counter <= 0;
S2_input_val <= 0;
if (input_vector_val == 1)
begin
current_state <= S_STAGE_ONE;
input_vector_i <= input_vector;
end
else
current_state <= S_IDLE;
end
S_STAGE_ONE:
begin
S2_input_val <= 0;
if (counter < C_NUM_CYCLES-1) begin
counter <= counter + 1;
end
if (S1_valid == ones)
begin
current_state <= S_STAGE_TWO;
S2_input_val <= 1;
end
else
current_state <= S_STAGE_ONE;
end
S_STAGE_TWO:
// not relevant
default:
$display("wrong state!");
endcase;
end
end // always@ (posedge clk)
always@(posedge clk) begin
if(rst) begin
S1_input_val <= 0;
S1_input <= 0;
end
else begin
if (current_state == S_STAGE_ONE) begin
S1_input_val <= 1;
S1_input <= input_vector_i[(counter_z[0]+1)*(C_S1_INPUT_LENGTH)-1 -: C_S1_INPUT_LENGTH];
end
else begin
S1_input <= 0;
S1_input_val <= 0;
end
end
end
always@(posedge clk) begin
counter_z[0] <= counter;
counter_z[1] <= counter_z[0];
counter_z[2] <= counter_z[1];
counter_z[3] <= counter_z[2];
end
Best Answer
It takes a lot of LUTs to build a large mux. For example, if you have 6-input LUTs, you can do a 4:1 mux in one LUT, but it takes 11 LUTs to do a 1-bit 32:1 mux.
Your counter is getting replicated (as
counter_reg
) so that the fanout on any given bit is not excessive. (Although I'm not really sure where the 1024 comes from.)Since you don't really need "random" access to the sub-fields of
input_vector_i
— just sequential access — have you considered using a shift register instead?