Electronic – the purpose of this Verilog code for implementing 3-port Block RAM

fpgalatticelm32synthesisverilog

LatticeMico32 (LM32) is a royalty-free CPU that I use to study how a pipelined in-order CPU may be implemented.

One particular troublesome point I have trouble with is how the register file is implemented. On a pipelined CPU, you will normally have at least three memory accesses to the register file on a given clock cycle:

  • 2 reads for both operands for the execution units.
  • 1 write from writeback stage

LM32 provides three ways to implement the register file:

  • Block RAM inference where reads/writes have extra logic to avoid parallel reads/writes.
  • Block RAM inference with out-of-phase clocks which don't require extra logic.
  • Distributed RAM inference.

In practice, even with distributed RAM inference, I have seen both Xilinx ise and yosys infer a block RAM with in phase read and write clocks. In addition, I've seen both synthesizers infer and at least part of the extra logic that the lm32 explicitly includes for a positive-edge Block RAM register file.

The inferred extra logic enables transparent reads. I have pasted the code here for lm32's explicit implementation, but I know from experimentation that yosys generates effectively the same code to place the register file in block RAM on iCE40:

// Register file
`ifdef CFG_EBR_POSEDGE_REGISTER_FILE
   /*----------------------------------------------------------------------
    Register File is implemented using EBRs. There can be three accesses to
    the register file in each cycle: two reads and one write. On-chip block
    RAM has two read/write ports. To accomodate three accesses, two on-chip
    block RAMs are used (each register file "write" is made to both block
    RAMs).
    One limitation of the on-chip block RAMs is that one cannot perform a
    read and write to same location in a cycle (if this is done, then the
    data read out is indeterminate).
    ----------------------------------------------------------------------*/
   wire [31:0] regfile_data_0, regfile_data_1;
   reg [31:0]  w_result_d;
   reg         regfile_raw_0, regfile_raw_0_nxt;
   reg         regfile_raw_1, regfile_raw_1_nxt;

   /*----------------------------------------------------------------------
    Check if read and write is being performed to same register in current
    cycle? This is done by comparing the read and write IDXs.
    ----------------------------------------------------------------------*/
   always @(reg_write_enable_q_w or write_idx_w or instruction_f)
     begin
        if (reg_write_enable_q_w
            && (write_idx_w == instruction_f[25:21]))
          regfile_raw_0_nxt = 1'b1;
        else
          regfile_raw_0_nxt = 1'b0;

        if (reg_write_enable_q_w
            && (write_idx_w == instruction_f[20:16]))
          regfile_raw_1_nxt = 1'b1;
        else
          regfile_raw_1_nxt = 1'b0;
     end

   /*----------------------------------------------------------------------
    Select latched (delayed) write value or data from register file. If
    read in previous cycle was performed to register written to in same
    cycle, then latched (delayed) write value is selected.
    ----------------------------------------------------------------------*/
   always @(regfile_raw_0 or w_result_d or regfile_data_0)
     if (regfile_raw_0)
       reg_data_live_0 = w_result_d;
     else
       reg_data_live_0 = regfile_data_0;

   /*----------------------------------------------------------------------
    Select latched (delayed) write value or data from register file. If
    read in previous cycle was performed to register written to in same
    cycle, then latched (delayed) write value is selected.
    ----------------------------------------------------------------------*/
   always @(regfile_raw_1 or w_result_d or regfile_data_1)
     if (regfile_raw_1)
       reg_data_live_1 = w_result_d;
     else
       reg_data_live_1 = regfile_data_1;

   /*----------------------------------------------------------------------
    Latch value written to register file
    ----------------------------------------------------------------------*/
   always @(posedge clk_i `CFG_RESET_SENSITIVITY)
     if (rst_i == `TRUE)
       begin
          regfile_raw_0 <= 1'b0;
          regfile_raw_1 <= 1'b0;
          w_result_d <= 32'b0;
       end
     else
       begin
          regfile_raw_0 <= regfile_raw_0_nxt;
          regfile_raw_1 <= regfile_raw_1_nxt;
          w_result_d <= w_result;
       end

// Two Block RAM instantiations follow to get 2 read/1 write port.

Transparent reads ensure that writes to the same address as a read from another port appear at the read port on the same clock edge as well (assume the read and write clocks are synchronous). The lm32 pipeline relies on the read ports immediately reflecting the written-back register value.

However, there is extra glue logic for dealing with a stall of the pipeline and I'm not certain what this code accomplishes, even after studying the CPU implementation in detail. I have commented the code below for convenience:

 ifdef CFG_EBR_POSEDGE_REGISTER_FILE
 // Buffer data read from register file, in case a stall occurs, and watch for
 // any writes to the modified registers
 always @(posedge clk_i `CFG_RESET_SENSITIVITY)
 begin
    if (rst_i == `TRUE)
    begin
        use_buf <= `FALSE;
        reg_data_buf_0 <= {`LM32_WORD_WIDTH{1'b0}};
        reg_data_buf_1 <= {`LM32_WORD_WIDTH{1'b0}};
    end
    else
    begin
        if (stall_d == `FALSE)
            use_buf <= `FALSE;
        else if (use_buf == `FALSE)
        begin
            // If we stall in the decode stage, unconditionally
            // buffer the register file values from the read ports.
            // They will be used instead when the stall ends.
            reg_data_buf_0 <= reg_data_live_0;
            reg_data_buf_1 <= reg_data_live_1;
            use_buf <= `TRUE;
        end
        if (reg_write_enable_q_w == `TRUE)
        // If either register's address matches the register
        // to be written back, replace the buffered read values.
        begin
            if (write_idx_w == read_idx_0_d)
                reg_data_buf_0 <= w_result;
            if (write_idx_w == read_idx_1_d)
                reg_data_buf_1 <= w_result;
        end
    end
end
endif

Why is this logic required, and only for in phase read/write clocks at that? Is this code similar to any other common idioms for dealing with reading the correct data from block RAM as implemented on FPGAs (i.e. similar to how synthesizers will infer transparent read/write code)?

I would have figured that during a stall of the decode stage of a RISC CPU, logic that ensures transparent reads would be enough to make sure the read ports have the correct data output when the stall ends. By the time a full clock cycle has passed after a simultaneous read/write has occurred to the same address on different ports, shouldn't the read ports' data output(s) have settled to the new value, so we only need to buffer the most immediate data written to the write port?

I've synthesized this CPU many times using the distributed RAM inference alone (inferred as block RAM), so either this logic is not required, or ise and yosys are capable of inferring the extra glue logic required.

Best Answer

This has been unanswered for a day and I think I know why. If Verilog code becomes a bit bigger and complex it is very difficult to see all the temporal relations. Even if the user puts lots of comments in (You said you added the comments so I assume was not the case here) you find that you have to run the simulation to see how it all hangs together.
To find out why that code is needed, remove it and see where things go wrong.

Having said that, I a can think of a possible scenario.

  • If the register file is a synchronous memory the data-out is lagging by one cycle.
  • The addresses to the register file are not stopped immediately in a decoder stall.
  • The data coming out is lost during the stall so must be captured.

This is no easy to describe in words so here is a timing diagram of that possible scenario:

enter image description here

In cycle 2 the need for a stall is detected. For some reason the addresses can not be stopped.
Cycle 3 is our extra stall cycle. Now the stall has gotten to the address logic so it will stop.
In Cycle 4 we want to continue but the data 'M1' is lost. Unless we store it during the stall, use it in cycle 4 and in cycle 5 all is OK again .

Note that with an a-synchronous register file the problem does no occur.


As a side note: I don't agree with your comment "unconditionally buffer the register file values" It is not 'unconditionally' because the followed code "if (reg_write_enable_q_w ..." takes precedence. That means there is an implicit "if there is no write happening" condition.