Now I'm trying to simulate the performance of Intel CORE 2 Duo processor (but I'll be very pleased with information about any other multi-core Intel processor) and it's work with the computer memory. As I understood, the main problem that exists with memory – is to support the coherence of cache. But how does the protocol MESI work with different layers? For example how does it apply to L2 and L3? I also would be very glad to know enything about non-exclusive write policy implementation and if the replascement algoritm of L2 is connected to the block being replaced from L1? Does Anyone know anything about this?
Electronic – Cache. MESI protocol for multilevel cache in Intel processors
cachecomputer-architectureprocessor
Related Solutions
(I do not know any HDL, but I hope the following will be helpful anyway.)
One can use a 32-bit wide interface and implement atomic 64-bit loads/stores. For loads one can "cheat" by reading from the invalidated cache entry (only checking the tags on the first 32-bit load), since one knows that the two 32-bit accesses will be back-to-back and within the same cache block that is known to be a hit.
For stores, since the cache block must be in modified (or exclusive if silent updates are allowed) state to accept a store, an invalidate request (really read-for-ownership) generates a data response. Since a data response is provided and the total time of the write would typically only be two processor cycles, the data response could be delayed until the store has completed.
LDSTUB (load-and-store-unsigned-byte) and SWAP could be handled somewhat similarly to a 64-bit store by delaying the load until the cache block is in exclusive/modified state; the store part of the operation is known to be immediately after the read portion and a data response is required anyway, so the data response can be delayed slightly.
An alternative implementation of LDSTUB and SWAP could treat an invalidation between the load and the store as a miss for the load, effectively reissuing the load. However, this presents a danger of livelock. While livelock issues can be managed (e.g., various back-off techniques), the earlier mentioned implementation is probably much simpler.
In a write-through cache every store operation from the processor simultaneously writes the new data into the cache-line and into the backing store (the next larger cache or the main memory).
In a write-back cache a store operation from the processor modifies only the cache-line, so the cache-line contains the most recent data while the data in the backing store is stale. The write to the backing store happens only when the cache line in question gets replaced because it is needed for some other line at a different address.
The pictures on the wikipedia page about caching are okay.
In a write-through cache every line is in only one of two states: valid or invalid. Thus when you need to fetch a line that is not in the cache you just throw out a line to make space for it
In a write-back cache every line can be in one of three states: valid, invalid, or dirty. When a read-miss occurs and you need to throw out a line to make space for the new line, the line you need to throw out may be dirty. If the line you need to throw out is dirty, you need to write it to the backing store before you can bring in the new line. This means that at the time you are processing a read-miss you may need to do two operations with the backing store instead of one.
If the same cache lines get written many times then write-back caches can dramatically reduce the number of times you need to send writes to the backing store. You just keep making modifications to the dirty cache line until the line needs to be replaced and then write back only the last values written to each location in the line.
Best Answer
I guess your question is about how a coherence protocol extends to multi-level caches. This book is a good reference. Here's my understanding:
Ill take the example of Core i7 (as I'm not very familiar with core 2 duo architecture).
In Core i7 every core has a private L1 and L2 cache, and all cores share a single large on-chip L3 cache. One can in-turn join multiple such processors using point-to-point links to form a NUMA system. So there are 4 levels in the memory hierarchy: L1, L2, L3, Main memory.
There is one coherence protocol between multiple L2's on a chip and the L3. There is a separate protocol between multiple L3's on separate chips. The two are independent of each other. One may use snooping, and other may use directory-based implementation. I think in Core i7, both are directory-based MESIF protocols (F is a new Forward state).
All caches in Core i7 are inclusive. This simplifies the protocol somewhat. As L2 is inclusive of L1, a block that is evicted from L2, has to be evicted from L1 too. Similarly, a block evicted from L3 has to be evicted from all L2s. L3 maintains core_valid bits for each block. A core_valid bit is set if the L2 cache of that core has a copy of the block. This way when a block is evicted from L3, the invalidations need to be sent to only those L2s that have a copy of the block. I guess the core_valid bits also act like a kind of directory. If you have inclusion, only the coherence messages for blocks existing in a lower level cache need to be forwarded to the higher level cache. So the lower level cache acts like a snoop filter.
I'm not sure I understand your question about the non-exclusive policy. Maybe this link will help.