Electronic – Is this BRAM being fully utilized if I use a different data width

bramfpgamemoryramxilinx

Background

I am using a Xilinx FPGA from the Kintek-7 family. The documentation for the memory resources can be found here.

Here are some important excerpts from the document (referencing pages 11 and 12):

The block RAM in Xilinx® 7 series FPGAs stores up to 36 Kbits of data and can be configured as either two independent 18 Kb RAMs, or one 36 Kb RAM. Each 36 Kb block RAM can be configured as a 64K x 1 (when cascaded with an adjacent 36 Kb block RAM), 32K x 1, 16K x 2, 8K x 4, 4K x 9, 2K x 18, 1K x 36, or 512 x 72 in simple dual-port mode. Each 18 Kb block RAM can be configured as a 16K x 1, 8K x 2, 4K x 4, 2K x 9, 1K x 18 or 512 x 36 in simple dual-port mode.

and

Each 36Kb block RAM can be set to simple dual-port (SDP) mode, doubling data width of the block RAM to 72 bits. The 18Kb block RAM can also be set to simple dual-port mode, doubling data width to 36 bits. Simple dual-port mode is defined as having one read-only port and one write-only port with independent clocks.

From this, I interpret that the data width of the BRAM is in multiples of 18. So it is 36 with dual port, or it could be 72 if you double that.

Question

In my implementation I only need a data width of 64 bits. What happens to this extra 8 bits? Does it mean that I will not be utilizing the entire BRAM block?

Here is a graphic to help understand the way that I'm thinking:

Best Answer

Correct, the remaining bits are unused.

This is something you just have to accept in FPGAs, you are never going to use all of the resources. It's the price you pay for configurability.

On the plus side, if at a later date you decide to add something like parity information or just make the data bus a little wider, you can do that essentially for free as there is space to add 8 more bits of width using the same memory.

Problem with Design #1

I have noticed that you must specify the two ports in two separate processes for XST to infer dual-port RAM - if you don't you won't get the two ports. Separate processes is also how Xilinx suggests infering Dual-port RAM in XST User Guide. Hence your Design #1 will only infer single-port ram.

You can see my general VHDL for infering dual-port RAM with XST at the bottom of this post. (Details: http://www.fpga-dev.com/infering-dual-port-blockram-with-xst/)

Problem with Design #2

In your Design #2, you register the addres twice, probably unintentionally. <= signal assignments are made at the end of the process, not immediately. This code is equivalent to yours, only with simpler signal names:

-- sequential context (A, B, C are signals):
if rising_edge(clk) then
  B <= A;
  C <= B;
end if;

Here C <= B; will not assign to C what was assigned to B on the previous line, since that assignment only takes effect at the end of the process. If the signals are bits and the stimuli is a pulse on A, this would be the result of the above code:

clk _|"|_|"|_|"|_|"|_|"|_|"|
A   ______|"""|_____________
B   __________|"""|_________
C   ______________|"""|_____

Declaring B a variable instead and assigning with := will assign immediately:

-- sequential context (A, C are signals; B is variable):
if rising_edge(clk) then
  B := A;
  C <= B;
end if;

yielding

clk _|"|_|"|_|"|_|"|_|"|_|"|
A   ______|"""|_____________
B   __________|"""|_________
C   __________|"""|_________

Infering dual-port BlockRam with XST

(More details on this at http://www.fpga-dev.com/infering-dual-port-blockram-with-xst/.)

Below is my parameterized module for generic dual-port RAM. It will successfully infer dual-port RAM, as desired, with XST.

(Remove the write enable-signals and write logic to get ROM instead of RAM.)

Specify width and depth with width and highAddr (one less than desired depth) generics.

library IEEE;
use IEEE.STD_LOGIC_1164.all;

entity genRAM is
  generic(
    width     : integer;
    highAddr  : integer -- highest address (= size-1)
  );
  port(
    -- Two sets of ports (A and B), each set having ports Adress, Data in,
    -- Data out and Write enable:
    Aaddr     : in  integer range 0 to highAddr        := 0;
    ADI       : in  std_logic_vector(width-1 downto 0) := (others => '0');
    ADO       : out std_logic_vector(width-1 downto 0) := (others => '0');
    AWE       : in  std_logic                          := '0';
    Baddr     : in  integer range 0 to highAddr        := 0;
    BDI       : in  std_logic_vector(width-1 downto 0) := (others => '0');
    BDO       : out std_logic_vector(width-1 downto 0) := (others => '0');
    BWE       : in  std_logic                          := '0';
    clk       : in  std_logic
  );
end genRAM;

architecture arch of genRAM is
  subtype TmemWord is bit_vector(width-1 downto 0);
  type    Tmem     is array(0 to highAddr) of TmemWord;
  shared variable memory: Tmem;

  process(clk) is
  begin
    if (rising_edge(clk)) then
      ADO <= To_StdLogicVector(memory(Aaddr));
      if (AWE = '1') then
        memory(Aaddr) := To_bitvector(std_logic_vector(ADI));
      end if;
    end if;
  end process;

  process(clk) is
  begin
    if (rising_edge(clk)) then    
      BDO <= To_StdLogicVector(memory(Baddr));
      if (BWE = '1') then
        memory(Baddr) := To_bitvector(std_logic_vector(BDI));
      end if;
    end if;
  end process;
end arch;

The code above implements read-first behavior. That means that if address 0x00 contains 0xcafe and you write 0xbabe to 0x00, the cycle after the write will display 0xcafe on the data-out port ("data is read to output port before being written to memory").

If you desire write-first behaviour, change order of the reading and writing for both processes, below is how it would be for port A:

-- excerpt for write-first behaviour:
if (AWE = '1') then
  memory(Aaddr) := To_bitvector(std_logic_vector(ADI));
end if;
ADO <= To_StdLogicVector(memory(Aaddr));

In the above case, data-out would display 0xbabe one cycle after the write ("data is written to memory before reading memory contents to output port").

Electronic – Inferring BRAM with unused addresses efficiently

Actually, using 8 BRAMs in an 8K×1 configuration, rather than 5 BRAMs in a 1K×8 configuration, is more efficient in several important ways.

With the 8 BRAMs, you can simply connect all of the address and control lines to all of the BRAMs, and one bit from the data input and data output buses to each of the BRAMs. No other logic is required at all.

On the other hand, with the 5-BRAM configuration, you'll need extra logic to decode the upper 3 address bits to enable one BRAM at a time, and you'll also need a 5:1 multiplexer on the data output bus to select the data from correct BRAM when reading. This uses extra resources within the FPGA, and it also adversely affects the timing, reducing the maximum clock frequency you can use.

If you really need to use the BRAM capacity as efficiently as possible, and you don't care about the timing and resource issues, then you'll have to explicitly code your memory as a module that uses five 1K×8 memories internally.