Electronic – How to improve timing on this design using so much BlockRAM

artix-series-fpgafpgaverilogxilinx

I'm building a small RISC FPGA CPU on an Artix-7 (Arty A7-100 (xc7a100tcsg324-1) board. It runs fine, most instructions take three clocks (non-pipelined), but due to crazy net delays, can only run at 100/50 MHz. 100 MHz for RAM which governs the speed of other signals. Most of the lowest-slack paths are those connecting to the RAM. I used the Clock Generator Wizard to create a dual output clocking IP.

The problem is that I'm using almost all the BlockRAM on the device, so my signal fanout looks like this (e.g. this is the signal fanout for ram_dinb to the BlockRAM, other connections are similar):

Fanout for ram_dinb signals

At 100 Mhz, the timing requirement is of course 10.0 ns. My Net delay here is 8.907, which seems absurdly high… but when I highlight and zoom in on these paths, they meander all over the place, so no wonder the net delay is so high…

So I studied a bit and even tried using the many horizontal BUFHCEs available, but this immediately failed because I used the Block Memory Generator to create this 256K-word RAM, it looks monolithic, even though it spans the entire die. So for example if I wanted to use this on a single BlockRAM primitive in region X1Y2, I might use something like:

wire clk_ram_12;    // region X1Y2
BUFHCE #( .CE_TYPE("SYNC"), .INIT_OUT(0) )
BUFHCE_ram_12 (
  .O(clk_ram_12),
  .CE(1'b1),
  .I(clk_ram) // from the MMCM 
);

Except that the Block Memory Generator interface only allows for two clock connections at the module level, not to mention all the other signals like data, address, etc.

RAM256 ram (
    .clka(clk_ram),
    .wea(2'b00), // Never write to port A because it's connected to the program counter.
    .addra(pc),
    .dina(18'b0), // Never write to port A because it's connected to the program counter.
    .douta(ram_douta),
    .clkb(clk_ram),
    .web(ram_web),
    .addrb(ram_addrb),
    .dinb(ram_dinb),
    .doutb(ram_doutb)
);

I feel like I'm missing something fundamental here. Is there any way I can speed this up? Or am I hopelessly painted into a corner with this BlockRAM? Even if I create a clock tree using BUFHCE and BUFG primitives somehow, it still doesn't solve my problems for other signals, like data, address, write-enable, etc. I've been reading Xilinx user guides for two days now with no promising ideas.

Best Answer

Your specification says a "256K-word RAM", and from the code you appear to be using 18b per word. That makes it a very large memory. An Artix 7 BRAM is 36kbit, so 2kWord. For a 256kWord memory, that means 128 BRAMs are required to be connected together.

The problem is not so much then the number of BRAMs, but the additional multiplexer logic required to connect the data outputs. You are essentially requesting 18 (one per bit) instances of a 128:1 multiplexer to select which BRAM output should drive your data bus. You in fact then need two copies of this structure as it's a dual-port RAM (one for each read port). This is going to be physically large.

If we assume a 6-input LUT (fine for Artix 7) which is enough to make a 4:1 multiplexer (4 data, 2 address), that means for each bit you will need a chain of 32 LUTs (128:32) followed by 8 LUTs (32:8) followed by two LUTs (8:2), and then a final LUT (2:1). You'll have 36 of these (two 18-bit ports).

Chaining four levels deep of lookup tables without pipelining will hurt your FMax. In fact it will be hurt even more as BRAMs typically have a not insignificant delay themselves due to the array decoding logic within them. Furthermore the chain is likely to be longer than this as you'll also need the associated address decoding logic, which means the path of the address signal to the output of the multiplexer will be even longer.

What you are going to need to acheive your desired Fmax is some pipelining stages. I am not familiar with the Xilinx block RAM design tools, but they may have the option to specify additional levels of pipelining, which if the core is well designed should be placed nicely within the external decoder logic.

Failing this, you could design a RAM with, say 72bit wide ports, which you can then pipeline and add your own external 4:1 multiplexer to bring these down to 18bit wide after a pipeline stage. I've used this technique for Altera tools before to improve the Fmax of very large memories.