Electronic – How to improve timing on this design using so much BlockRAM

artix-series-fpgafpgaverilogxilinx

I'm building a small RISC FPGA CPU on an Artix-7 (Arty A7-100 (xc7a100tcsg324-1) board. It runs fine, most instructions take three clocks (non-pipelined), but due to crazy net delays, can only run at 100/50 MHz. 100 MHz for RAM which governs the speed of other signals. Most of the lowest-slack paths are those connecting to the RAM. I used the Clock Generator Wizard to create a dual output clocking IP.

The problem is that I'm using almost all the BlockRAM on the device, so my signal fanout looks like this (e.g. this is the signal fanout for ram_dinb to the BlockRAM, other connections are similar):

At 100 Mhz, the timing requirement is of course 10.0 ns. My Net delay here is 8.907, which seems absurdly high… but when I highlight and zoom in on these paths, they meander all over the place, so no wonder the net delay is so high…

So I studied a bit and even tried using the many horizontal BUFHCEs available, but this immediately failed because I used the Block Memory Generator to create this 256K-word RAM, it looks monolithic, even though it spans the entire die. So for example if I wanted to use this on a single BlockRAM primitive in region X1Y2, I might use something like:

wire clk_ram_12;    // region X1Y2
BUFHCE #( .CE_TYPE("SYNC"), .INIT_OUT(0) )
BUFHCE_ram_12 (
  .O(clk_ram_12),
  .CE(1'b1),
  .I(clk_ram) // from the MMCM 
);

Except that the Block Memory Generator interface only allows for two clock connections at the module level, not to mention all the other signals like data, address, etc.

RAM256 ram (
    .clka(clk_ram),
    .wea(2'b00), // Never write to port A because it's connected to the program counter.
    .addra(pc),
    .dina(18'b0), // Never write to port A because it's connected to the program counter.
    .douta(ram_douta),
    .clkb(clk_ram),
    .web(ram_web),
    .addrb(ram_addrb),
    .dinb(ram_dinb),
    .doutb(ram_doutb)
);

I feel like I'm missing something fundamental here. Is there any way I can speed this up? Or am I hopelessly painted into a corner with this BlockRAM? Even if I create a clock tree using BUFHCE and BUFG primitives somehow, it still doesn't solve my problems for other signals, like data, address, write-enable, etc. I've been reading Xilinx user guides for two days now with no promising ideas.

Best Answer

Your specification says a "256K-word RAM", and from the code you appear to be using 18b per word. That makes it a very large memory. An Artix 7 BRAM is 36kbit, so 2kWord. For a 256kWord memory, that means 128 BRAMs are required to be connected together.

The problem is not so much then the number of BRAMs, but the additional multiplexer logic required to connect the data outputs. You are essentially requesting 18 (one per bit) instances of a 128:1 multiplexer to select which BRAM output should drive your data bus. You in fact then need two copies of this structure as it's a dual-port RAM (one for each read port). This is going to be physically large.

If we assume a 6-input LUT (fine for Artix 7) which is enough to make a 4:1 multiplexer (4 data, 2 address), that means for each bit you will need a chain of 32 LUTs (128:32) followed by 8 LUTs (32:8) followed by two LUTs (8:2), and then a final LUT (2:1). You'll have 36 of these (two 18-bit ports).

Chaining four levels deep of lookup tables without pipelining will hurt your FMax. In fact it will be hurt even more as BRAMs typically have a not insignificant delay themselves due to the array decoding logic within them. Furthermore the chain is likely to be longer than this as you'll also need the associated address decoding logic, which means the path of the address signal to the output of the multiplexer will be even longer.

What you are going to need to acheive your desired Fmax is some pipelining stages. I am not familiar with the Xilinx block RAM design tools, but they may have the option to specify additional levels of pipelining, which if the core is well designed should be placed nicely within the external decoder logic.

Failing this, you could design a RAM with, say 72bit wide ports, which you can then pipeline and add your own external 4:1 multiplexer to bring these down to 18bit wide after a pipeline stage. I've used this technique for Altera tools before to improve the Fmax of very large memories.

Related Solutions

Synchronizing input and output

I think I'd have answered this interview question the same way you did. I believe the interviewer's requirement "to be done without a FIFO" was because a FIFO buffer is a valid, practical way to solve the problem of multiple clock domains -- but it can be done without the head/tail logic of a complete FIFO in many cases. And in the context of a job interview, simply instantiating a standard module doesn't demonstrate that you understand how to approach FPGA / HDL design. (I've interviewed candidates who couldn't even manage that small task.)

Passing data between different clock domains is usually done with three stages of flip-flops. The first stage is in the source clock domain (clkA), and the second and third stage flip-flops are in the receiver clock domain (clkB). The setup time of the second stage flip-flop is sometimes violated because the clocks are not synchronous, so the third-stage flip-flop is used to clean up the timing. Since there is a delay, the data_valid signal is passed in parallel with the data.

module SyncExample (
    input   wire            clkA,
    input   wire    [7:0]   Data_in,        // in clkA clock domain
    input   wire            Data_valid,     // in clkA clock domain
    input   wire            clkB,
    output  reg     [7:0]   Data_out,       // in clkB clock domain
    output  reg             Data_out_valid  // in clkB clock domain
    )

// First stage pipeline registers the clkA clock domain signals.
// pipeline_1_valid is set by Data_valid and remains set 
// until cleared by pipeline_1_valid_clear acknowledge from clkB domain.
reg [7:0] pipeline_1_data;
reg       pipeline_1_valid;
wire      pipeline_1_valid_clear;
initial begin
    pipeline_1_data <= 0;
    pipeline_1_valid <= 0;
end
always @(posedge clkA) begin
    if (Data_valid) begin
        // capture pipeline_1_data only when Data_in is valid
        pipeline_1_data <= Data_in;
    end
    // keep pipeline_1_valid set after Data_valid, until pipeline_1_valid_clear.
    pipeline_1_valid <= (Data_valid | (pipeline_1_valid & ~pipeline_1_valid_clear));
end

// Second stage pipeline registers the clkB clock domain signals.
// Because clkA and clkB are asynchronous clock domains, 
// setup time cannot be guaranteed for this stage.
// The previous pipeline_1 stage holds its data valid for
// more than one clkA cycle, to help achieve clkB setup requirement.
reg [7:0] pipeline_2_data;
reg       pipeline_2_valid;
initial begin
    pipeline_2_data <= 0;
    pipeline_2_valid <= 0;
end
always @(posedge clkB) begin
    pipeline_2_data <= pipeline_1_data;
    pipeline_2_valid <= pipeline_1_valid;
end

// Third stage pipeline registers the clkB clock domain signals.
initial begin
    Data_out <= 0;
    Data_out_valid <= 0;
end
always @(posedge clkB) begin
    Data_out <= pipeline_2_data;
    Data_out_valid <= pipeline_2_valid;
end

// pipeline_1_valid_clear timing feedback signals when the data-valid signal
// has propagated through all stages.
// For this simple example, we assume data_out is captured as soon as it is valid.
// A practical application should instead drive this with a read_data_out command.
assign pipeline_1_valid_clear = Data_out_valid;

endmodule;

You can also find similar example code in Xilinx ISE Language Templates under Verilog | Synthesis Constructs | Coding Examples | Misc | Asynchronous Input Synchronization.

edit: Added pipeline_1_valid_clear signal and set/clear behavior to meet the slower clock domain's minimum pulse width requirement. Capture pipeline_1_data only when Data_in is valid.

Electronic – Why won’t the Xilinx block RAM in a Spartan-3E consistently return data in a single clock cycle

Newly written data at the rising-edge is available directly after this edge only at the same port. Actually, the data input is internally forwarded to the data output of the same RAM port. Also called WRITE_FIRST mode.

But, it is never forwarded to the output of the other RAM port, regardless of the specified WRITE_MODE. It will be available for reading (of course at a another rising edge) after the internal write to the memory has been completed. In your example it is just the next rising clock edge, because the internal write time is always smaller (faster) than the minimum allowed clock period.

This behavior is described in XAPP 463 Using Block RAM in Spartan-3 Generation FPGAs in section Dual-Port RAM Conflicts and Resolution. The given example there uses different clocks, but is also applies whe the same clock is used for both ports.

This behaviour is still the same in current FPGAs from Xilinx and Altera.

The forwarding to the other RAM port has to be done by your one with surrounding logic.

Best Answer

Related Solutions

Synchronizing input and output

Electronic – Why won’t the Xilinx block RAM in a Spartan-3E consistently return data in a single clock cycle

Related Topic