I have written a simulation module for a Direct Mapped Cache (consisting of data, tag, and valid rams and cache controller) in Verilog. I now want to implement a Processor Core/Driver (also in Verilog) which will provide read, write and other instructions to the Cache Module for reading/writing data. I'd be thankful for providing any help in implementing this processor module as I have no clue where and how to begin.
Implementing Processor Core for Cache Module in Verilog
cachecomputer-architecturememorytagverilog
Related Solutions
Your code simulates two multiplexers. These are actually asynchronous components. The fact that Verilog requires data1_temp
and data2_temp
to be declared as reg
's is a quirk of Verilog syntax and your choice of coding style, and doesn't mean these signals would be the outputs of storage elements in a physical implementation.
If you want to capture these values in actual registers, you need to add those explicitly:
reg [7:0] data1, data2;
always @(posedge someclock) begin
data1 <= data1_tmp;
data2 <= data2_tmp;
end
But I would like to know what this mini register file would be made of in hardware. Particularly, the 4x8 bit array consisting of k0,k1,k2,k3.
You haven't shown how these variables are assigned, so it's not possible to say how they are implemented. As your code showed, just declaring them as reg
doesn't guarantee they are implemented with actual storage elements. If you assign them inside a block that begins always @(posedge clk)
then very likely they are flip-flops, but there are ways you could code them that would make them synthesize differently.
I thought when it came to registers and arrays like this, you need a clock to read out data, like RAM?
You need a clock to update a (physical) register. You can read it out at any time. For example:
wire [8:0] sum;
assign sum = k0 + k1;
is perfectly valid code. sum
will change whenever any of its inputs changes. If k0
and k1
are the outputs of flip-flops, their values will only change when there is a clock edge.
For another example, you could equally well describe your multiplexers with code like this:
reg [7:0] k0, k1, k2, k3;
wire [7:0] data1_tmp;
reg [1:0] reg1;
// k<n> and reg1 are assigned elsewhere.
assign data1_tmp = (reg1 == 0) ? k0 :
(reg1 == 1) ? k1 :
(reg1 == 2) ? k2 : k3;
how do I read from this tag_array and do the comparison all within the same clock cycle?
Let me repeat a key point for emphasis: You need to use a clock to assign a new value to a register (an actual hardware register or group of flip-flops). It's output is available at any time.
RAMs are different and how you access the contents of a RAM will depend on details of the type of RAM you use.
I got confused because frankly I don't know enough about memory hardware and how that's possible.
Another key strategy: When you are learning digital logic, I recommend you learn about the physical hardware first, and then work out or study how to simulate it in HDL second. So first, learn what a physical flip-flop is, then learn the standard Verilog methods of describing a flip-flop. Especially if you are trying to write HDL for synthesis, trying to write good code before you learn the capabilities of the underlying hardware will lead you down a lot of dead-end paths.
(I do not know any HDL, but I hope the following will be helpful anyway.)
One can use a 32-bit wide interface and implement atomic 64-bit loads/stores. For loads one can "cheat" by reading from the invalidated cache entry (only checking the tags on the first 32-bit load), since one knows that the two 32-bit accesses will be back-to-back and within the same cache block that is known to be a hit.
For stores, since the cache block must be in modified (or exclusive if silent updates are allowed) state to accept a store, an invalidate request (really read-for-ownership) generates a data response. Since a data response is provided and the total time of the write would typically only be two processor cycles, the data response could be delayed until the store has completed.
LDSTUB (load-and-store-unsigned-byte) and SWAP could be handled somewhat similarly to a 64-bit store by delaying the load until the cache block is in exclusive/modified state; the store part of the operation is known to be immediately after the read portion and a data response is required anyway, so the data response can be delayed slightly.
An alternative implementation of LDSTUB and SWAP could treat an invalidation between the load and the store as a miss for the load, effectively reissuing the load. However, this presents a danger of livelock. While livelock issues can be managed (e.g., various back-off techniques), the earlier mentioned implementation is probably much simpler.
Best Answer
Visit and join over at OpenCores. There is a section there of various types of CPU cores that can be plugged into an FPGA design.