Virtual Memory, Cache, and TLB’s

cachecomputer-architecturedesignmemory

I recently asked this question over at stackoverflow, but I was told by a user that it was off-topic, so I am posting it here since it's more of a hardware question.

I'm trying to study for an exam and I'm trying to answer the question below.

I'll be honest, I've asked this question on here before, but now that I have a better understanding of it, I felt I should make a new question to fine-tune my understanding, since my question is somewhat different now.

Consider a virtual memory system with the following properties:

  • 35-bit virtual address
  • 16 KB pages
  • 32-bit physical address

Assume that this virtual memory system is implemented with an eight-way set associative TLB. The TLB has total of 256 TLB entries, with each TLB entry representing one virtual-to-physical page number translation.

A 64 KB data cache is a two-way set associative cache. The data cache’s block size is 128 Bytes.

Show the virtual to physical mapping with a figure drawn in a way similar to the figure below (but with all necessary changes required for the TLB and the data cache specified in this question).

Specify the width of all fields and signals coming in and out of (as well as the number of comparisons done by) the TLB and the data cache for each memory address.

Figure

So far, I've come up with the following:
(Direct link here for larger image)

I've come up with this:

Schematic

In case you're curious about the font, it's called Waltograph, a TTF I downloaded from the web, it was set as the default in paint, so I decided to enlighten my studying with some Disney magic.

Anyway, since the problem states that we have 16KB pages (\$2^4 \cdot 2^6 = 2^{\textbf{14}}\$ byte pages), we then need an offset of 14 bits as indicated in my schematic.

Then, since I have a total of 256 TLB entries and an 8-way associative TLB, we need an index of 5 bits wide (\$256/8 = 32 = 2^{\textbf{5}}\$). Then whatever is left over is used for the tag as indicated.

Then I have 8 comparators, as shown, which compare each tag at some index to the specified one. The result is put through an and gate with the valid bit to determine if we have a hit or not (I put the outputs of the and gate through an or gate to concatenate it into one signal)

Once we know we have a hit, we need to extract the data (physical address) from the TLB. I used a one-hot mux to select the physical address of the desired TLB entry. Then the output of the physical address is concatenated with the offset from the virtual address.

Now what I'm confused about is the cache part. I understand that the TLB is essentially a cache of most recently used physical addresses. However, I don't understand what's going on in the book's diagram. It suddenly splits up the physical address and uses it to index the cache, I guess. But why is it showing the cache and data separately? and why is the byte offset just left floating? I'm pretty sure the cache is supposed to store data as well. I don't think its sole purpose is to determine whether or not there's a hit or miss inside of it. I apologize for my ignorance in advance, but the book barely covers TLB's (it's like a little more than a page) and it doesn't do a very good job at explaining the relationship between a TLB and cache.

I'd appreciate if somebody could verify what I've done so far and explain what cache has to do with this to me. Thanks.

Best Answer

I agree that illustration is confusing.

The top half of the page is intended to describe the TLB. It sounds like you understand TLB stuff pretty well.

The entire bottom half of the page is intended to describe the data cache. (The label "cache" on the left is intended to apply to the entire bottom half of the page. How could it be redrawn to make it more obvious that it applies not only to the cache metadata valid+tag bits, but also all the data all the way to the right edge of the page?).

It suddenly splits up the physical address and uses it to index the cache, I guess.

Yes. The bottom half of that page, as you just said, and like most large caches, is a physically-indexed, physically-tagged data cache.

But why is it showing the cache and data separately?

That part of the illustration is unnecessarily confusing.

While in principle each word of memory could have its own valid+tag bits, most data caches share the valid+tag bits for a much larger block of data copied from main memory -- a block called a cache line. Loading more data than the program specifically asked for in a single instruction is often helpful, because practically all programs have some spatial locality.

The resulting cache entry structure looks something like

v tag w w w w  w w w w  w w w w  w w w w
v tag w w w w  w w w w  w w w w  w w w w
v tag w w w w  w w w w  w w w w  w w w w
v tag w w w w  w w w w  w w w w  w w w w
v tag w w w w  w w w w  w w w w  w w w w
v tag w w w w  w w w w  w w w w  w w w w
v tag w w w w  w w w w  w w w w  w w w w
v tag w w w w  w w w w  w w w w  w w w w

where the 'v' indicates the valid bit, and each 'w' represents a word of data.

Inexplicably, the book's illustration only shows one of the many blocks of data in the cache:

v tag
v tag
v tag
v tag
v tag
v tag w w w w  w w w w  w w w w  w w w w  < -- hit on this cache line.
v tag
v tag

and then the book's illustration inexplicably rotates the words in that cache line to show all the words of that one cache line stacked on top of each other.

When the data cache detects a hit -- when the cache tag matches the tag part of the desired address, and the valid bit is set -- then the "block offset" part of the address indicates one particular word of that one particular cache line.

Perhaps the illustrator ran out of room drawing an extremely wide cache line, and arbitrarily decided to rotate that line to make it fit on the page without considering how confusing that would be?

The data cache’s block size is 128 Bytes.

So for any physical byte address, the bottom 7 bits indicate some particular byte within a cache line, and all the upper bits of that address are used to select some particular cache line.

why is the byte offset just left floating?

The byte offset is left floating in this illustration, because the byte offset is not used by the TLB or by the data cache. A typical TLB and the data cache, like the one illustrated, only deal with aligned 32-bit words. The 2 bits of the address that select one of the 4 bytes within a 32-bit word are handled elsewhere.

Some simple CPUs only have hardware for aligned whole-word access. (I call them "Neither Endian" in "DAV's Endian FAQ"). Compiler writers for such CPUs must add padding to ensure that every instruction is aligned and every data value is aligned. (The two-bit byte offset should always be zeros on these machines).

Many CPUs have a LOAD instruction that can load unaligned 32-bit values into a 32-bit register. Such CPUs have special hardware elsewhere (not part of the cache) that, for each LOAD instruction (sometimes) does 2 reads from the data cache -- the unaligned 32-bit value can overlap 2 different cache lines; either or both read may cause a cache miss. The 2 bits of the address that select one of the 4 bytes within a (aligned) 32-bit word are used internally by the CPU to select the relevant bytes that the cache returns for those reads and re-assemble those bytes into the (unaligned) 32-bit value that the programmer expects. Even though such instructions give the correct results no matter how things are aligned or mis-aligned in memory, assembly language programmers and compiler writers and other programmers obsessed with optimization sometimes add padding anyway to get (some) instructions aligned or (some) data aligned or both. ("How and when to align to cache line size?"; "Aligning to cache line and knowing the cache line size"; etc.) They try to justify this padding by claiming it "optimizes" the program to "run faster". Recent tests seem to indicate data alignment for speed is a myth.

the relationship between a TLB and cache

Conceptually the only connection between the TLB and a (physically-indexed, physically-tagged) data cache is the bundle of wires carrying the physical-address output of the TLB to the physical-address input of the data cache.

One person can design a data cache for a simple CPU without virtual memory that caches physical addresses. Another person can design a TLB for a simple CPU that has no data cache (A CPU with a TLB but no data cache was once a common arrangement for mainframe computers).

In principle, a third person can splice that TLB and that data cache together, wiring the physical-address output of the TLB to the physical-address input of the data cache. The TLB neither knows nor cares that it is now connected to the data cache rather than the main memory address bus. The the data cache neither knows nor cares that it is now connected to the TLB rather than directly to the CPU address register(s).