Electronic – Processor – L1 Data cache interface

cachecomputer-architecturehdlinterfaceprocessor

Sorry if the following looks like a very specialized (or programming) question, but I'm hoping there are people on this forum who have done VHDL/Verilog modeling, and might be able to answer:

I'm writing a simulation model of a multi-processor cache system. My processor model is a 32-bit Sparc V8 processor. I was trying to understand how the processor- L1 data cache interface looks like. I have the following doubts:

  1. How wide is the processor-L1 interface? If it is 32 bits wide, then how are doubleword accesses handled atomically? Example: if the DoubleWord instruction is split into two word-accesses, can the block in the cache get invalidated between the first and the second word access? Doesn't it mean the instruction is not atomic? Is the load/store doubleword instruction required to be atomic?

  2. How are atomic load/store or swap instructions implemented on this interface? Is there a signal going from the processor to cache that says "stall all other operations until I say so", and then execute a load followed by store?

I'd be thankful for any links pointing in this direction

Best Answer

(I do not know any HDL, but I hope the following will be helpful anyway.)

One can use a 32-bit wide interface and implement atomic 64-bit loads/stores. For loads one can "cheat" by reading from the invalidated cache entry (only checking the tags on the first 32-bit load), since one knows that the two 32-bit accesses will be back-to-back and within the same cache block that is known to be a hit.

For stores, since the cache block must be in modified (or exclusive if silent updates are allowed) state to accept a store, an invalidate request (really read-for-ownership) generates a data response. Since a data response is provided and the total time of the write would typically only be two processor cycles, the data response could be delayed until the store has completed.

LDSTUB (load-and-store-unsigned-byte) and SWAP could be handled somewhat similarly to a 64-bit store by delaying the load until the cache block is in exclusive/modified state; the store part of the operation is known to be immediately after the read portion and a data response is required anyway, so the data response can be delayed slightly.

An alternative implementation of LDSTUB and SWAP could treat an invalidation between the load and the store as a miss for the load, effectively reissuing the load. However, this presents a danger of livelock. While livelock issues can be managed (e.g., various back-off techniques), the earlier mentioned implementation is probably much simpler.