Problem with Design #1
I have noticed that you must specify the two ports in two separate processes for XST to infer dual-port RAM - if you don't you won't get the two ports. Separate processes is also how Xilinx suggests infering Dual-port RAM in XST User Guide. Hence your Design #1 will only infer single-port ram.
You can see my general VHDL for infering dual-port RAM with XST at the bottom of this post. (Details: http://www.fpga-dev.com/infering-dual-port-blockram-with-xst/)
Problem with Design #2
In your Design #2, you register the addres twice, probably unintentionally. <=
signal assignments are made at the end of the process, not immediately. This code is equivalent to yours, only with simpler signal names:
-- sequential context (A, B, C are signals):
if rising_edge(clk) then
B <= A;
C <= B;
end if;
Here C <= B;
will not assign to C what was assigned to B on the previous line, since that assignment only takes effect at the end of the process. If the signals are bits and the stimuli is a pulse on A
, this would be the result of the above code:
clk _|"|_|"|_|"|_|"|_|"|_|"|
A ______|"""|_____________
B __________|"""|_________
C ______________|"""|_____
Declaring B
a variable
instead and assigning with :=
will assign immediately:
-- sequential context (A, C are signals; B is variable):
if rising_edge(clk) then
B := A;
C <= B;
end if;
yielding
clk _|"|_|"|_|"|_|"|_|"|_|"|
A ______|"""|_____________
B __________|"""|_________
C __________|"""|_________
Infering dual-port BlockRam with XST
(More details on this at http://www.fpga-dev.com/infering-dual-port-blockram-with-xst/.)
Below is my parameterized module for generic dual-port RAM. It will successfully infer dual-port RAM, as desired, with XST.
(Remove the write enable-signals and write logic to get ROM instead of RAM.)
Specify width and depth with width
and highAddr
(one less than desired depth) generics.
library IEEE;
use IEEE.STD_LOGIC_1164.all;
entity genRAM is
generic(
width : integer;
highAddr : integer -- highest address (= size-1)
);
port(
-- Two sets of ports (A and B), each set having ports Adress, Data in,
-- Data out and Write enable:
Aaddr : in integer range 0 to highAddr := 0;
ADI : in std_logic_vector(width-1 downto 0) := (others => '0');
ADO : out std_logic_vector(width-1 downto 0) := (others => '0');
AWE : in std_logic := '0';
Baddr : in integer range 0 to highAddr := 0;
BDI : in std_logic_vector(width-1 downto 0) := (others => '0');
BDO : out std_logic_vector(width-1 downto 0) := (others => '0');
BWE : in std_logic := '0';
clk : in std_logic
);
end genRAM;
architecture arch of genRAM is
subtype TmemWord is bit_vector(width-1 downto 0);
type Tmem is array(0 to highAddr) of TmemWord;
shared variable memory: Tmem;
process(clk) is
begin
if (rising_edge(clk)) then
ADO <= To_StdLogicVector(memory(Aaddr));
if (AWE = '1') then
memory(Aaddr) := To_bitvector(std_logic_vector(ADI));
end if;
end if;
end process;
process(clk) is
begin
if (rising_edge(clk)) then
BDO <= To_StdLogicVector(memory(Baddr));
if (BWE = '1') then
memory(Baddr) := To_bitvector(std_logic_vector(BDI));
end if;
end if;
end process;
end arch;
The code above implements read-first behavior. That means that if address 0x00
contains 0xcafe
and you write 0xbabe
to 0x00
, the cycle after the write will display 0xcafe
on the data-out port ("data is read to output port before being written to memory").
If you desire write-first behaviour, change order of the reading and writing for both processes, below is how it would be for port A:
-- excerpt for write-first behaviour:
if (AWE = '1') then
memory(Aaddr) := To_bitvector(std_logic_vector(ADI));
end if;
ADO <= To_StdLogicVector(memory(Aaddr));
In the above case, data-out would display 0xbabe
one cycle after the write ("data is written to memory before reading memory contents to output port").
Best Answer
Assuming you need a read cycle on each port on each clock cycle, each BRAM will give you two read ports. Beyond that, you have to replicate the contents of the memory.
Is the bandwidth required at each port less than the raw bandwidth of the BRAM? In that case, you might consider multiplexing the ports. Use a counter that runs at the full speed of the BRAM to drive a multiplexer that scans the address bus for each port, feed these addresses to the BRAM, and then deliver the data (typically 2 clocks later) to the corresponding data bus for each port.
The downside of this approach is that the access latency for each port is now N clocks longer than the non-multiplexed case. There are various ways to deal with this latency, including adding additional pipeline stages to the other data paths.
Note that with a 2-port BRAM inside the module, you can scan two of the external ports at a time.