I agree that illustration is confusing.
The top half of the page is intended to describe the TLB.
It sounds like you understand TLB stuff pretty well.
The entire bottom half of the page is intended to describe the data cache.
(The label "cache" on the left is intended to apply to the entire bottom half of the page. How could it be redrawn to make it more obvious that it applies not only to the cache metadata valid+tag bits, but also all the data all the way to the right edge of the page?).
It suddenly splits up the physical address and uses it to index the
cache, I guess.
Yes. The bottom half of that page, as you just said, and like most large caches, is a physically-indexed, physically-tagged data cache.
But why is it showing the cache and data separately?
That part of the illustration is unnecessarily confusing.
While in principle each word of memory could have its own valid+tag bits, most data caches share the valid+tag bits for a much larger block of data copied from main memory -- a block called a cache line.
Loading more data than the program specifically asked for in a single instruction is often helpful, because practically all programs have some spatial locality.
The resulting cache entry structure looks something like
v tag w w w w w w w w w w w w w w w w
v tag w w w w w w w w w w w w w w w w
v tag w w w w w w w w w w w w w w w w
v tag w w w w w w w w w w w w w w w w
v tag w w w w w w w w w w w w w w w w
v tag w w w w w w w w w w w w w w w w
v tag w w w w w w w w w w w w w w w w
v tag w w w w w w w w w w w w w w w w
where the 'v' indicates the valid bit, and each 'w' represents a word of data.
Inexplicably, the book's illustration only shows one of the many blocks of data in the cache:
v tag
v tag
v tag
v tag
v tag
v tag w w w w w w w w w w w w w w w w < -- hit on this cache line.
v tag
v tag
and then the book's illustration inexplicably rotates the words in that cache line to show all the words of that one cache line stacked on top of each other.
When the data cache detects a hit --
when the cache tag matches the tag part of the desired address, and the valid bit is set --
then the "block offset" part of the address indicates one particular word of that one particular cache line.
Perhaps the illustrator ran out of room drawing an extremely wide cache line, and arbitrarily decided to rotate that line to make it fit on the page without considering how confusing that would be?
The data cache’s block size is 128 Bytes.
So for any physical byte address, the bottom 7 bits indicate some particular byte within a cache line, and all the upper bits of that address are used to select some particular cache line.
why is the byte offset just left floating?
The byte offset is left floating in this illustration, because the byte offset is not used by the TLB or by the data cache. A typical TLB and the data cache, like the one illustrated, only deal with aligned 32-bit words.
The 2 bits of the address that select one of the 4 bytes within a 32-bit word are handled elsewhere.
Some simple CPUs only have hardware for aligned whole-word access.
(I call them "Neither Endian" in "DAV's Endian FAQ").
Compiler writers for such CPUs must add padding to ensure that every instruction is aligned and every data value is aligned.
(The two-bit byte offset should always be zeros on these machines).
Many CPUs have a LOAD instruction that can load unaligned 32-bit values into a 32-bit register.
Such CPUs have special hardware elsewhere (not part of the cache) that, for each LOAD instruction (sometimes) does 2 reads from the data cache -- the unaligned 32-bit value can overlap 2 different cache lines; either or both read may cause a cache miss.
The 2 bits of the address that select one of the 4 bytes within a (aligned) 32-bit word are used internally by the CPU to select the relevant bytes that the cache returns for those reads and re-assemble those bytes into the (unaligned) 32-bit value that the programmer expects.
Even though such instructions give the correct results no matter how things are aligned or mis-aligned in memory, assembly language programmers and compiler writers and other programmers obsessed with optimization sometimes add padding anyway to get (some) instructions aligned or (some) data aligned or both.
("How and when to align to cache line size?";
"Aligning to cache line and knowing the cache line size";
etc.)
They try to justify this padding by claiming it "optimizes" the program to "run faster".
Recent tests seem to indicate data alignment for speed is a myth.
the relationship between a TLB and cache
Conceptually the only connection between the TLB and a (physically-indexed, physically-tagged) data cache is the bundle of wires carrying the physical-address output of the TLB to the physical-address input of the data cache.
One person can design a data cache for a simple CPU without virtual memory that caches physical addresses.
Another person can design a TLB for a simple CPU that has no data cache (A CPU with a TLB but no data cache was once a common arrangement for mainframe computers).
In principle,
a third person can splice that TLB and that data cache together, wiring the physical-address output of the TLB to the physical-address input of the data cache.
The TLB neither knows nor cares that it is now connected to the data cache rather than the main memory address bus.
The the data cache neither knows nor cares that it is now connected to the TLB rather than directly to the CPU address register(s).
Best Answer
In general a page table must be able to cover the entire virtual address space. However, this does not mean that every page needs to have a page table entry.
For 32-bit systems the most popular means of achieving this is using a hierarchical (also called multilevel) page table. With 4-byte PTEs and 4 KiB pages, a hierarchical page table would use a page table base pointer that points to a page of Page Directory Entries. The most significant 10 bits of the address are used to index a 4-byte PDE. If the valid bit of the PDE is set, then the address field of the PDE points to a page of PTEs which is indexed by the remaining 10 translated address bits (the least significant 12 bits are not used in translation but are the offset within the page).
(Linear page tables use a similar arrangement where the page table is mapped in the virtual address space, typically starting at virtual address 0. The virtual address of the desired PTE is translated by the TLB. If this generates a TLB miss, then the virtual address of the PTE [equivalent to a PDE] that translates the page table virtual address is translated by the TLB. If that misses, then the virtual address of the PTE that translates that page directory page virtual address is used--this is equivalent to the page table base pointer in a hierarchical page table and is often stored in a special purpose register. Linear page tables have the disadvantages of excluding a portion of the virtual address space for ordinary use, potentially 'wasting' TLB entries, and having more variable TLB miss overhead, but the advantage of exploiting the TLB to cache PDE-equivalents and potentially more natural support of variable sized translation pages, which can reduce the overhead of caching PDE-equivalents and in larger address space potentially remove a level of look-up. The variability of miss overhead can be managed by locking TLB entries for critical parts of the page table.)
With such tables, as long as the valid portion of the virtual address space is not sparse, many PDEs can be marked as invalid; this effectively compresses a 4 KiB page table holding invalid PTEs into the 4 bytes of the PDE.
If a PDE that has its valid bit cleared (and therefore the page of PTEs is not present) is referenced, then this would be treated as a form of page fault and the necessary populating of the subtable (e.g., for a large [4 MiB] chunk of allocated memory that has never been referenced, the subtable might be filled with translations pointing to a read-only page of zeros). Since the subtable would ideally only be swapped out (which may not be supported by the OS anyway) if all PTEs are invalid (unallocated or swapped out), having a virtual address with its contents in memory but not having in memory the subtable that provides the translation is very unlikely. Invalid PDEs are typically used for chunks of the virtual address space that are not allocated (i.e., an access would generate a permission violation exception).
In a typical 32-bit system, if every 4 MiB chunk of the virtual address space had at least one valid page, then the entire page table would have to be present (barring swapping out of subtables). Common systems do not allocate memory sparsely, so many subtables would have no valid PTEs and can be represented by marking the PDE as invalid.
Another, somewhat less popular page table organization is the inverted (or hashed) page table. This uses a hash table to store the PTEs. This requires each entry (which may include multiple PTEs, though no ISA defines such use) to be tagged with enough of the virtual address to uniquely identify the entry. This has the disadvantage that hash conflicts must be managed. (Providing more than one entry per index/hash bucket can reduce the frequency of conflicts.) For non-sparse use of the address space, inverted page tables generally have more storage and bandwidth overhead than hierarchical page tables and add complexity for handling hash conflicts. For (unusual) access patterns which do not have spatial locality--especially for large address spaces--or compared to hierarchical page tables that do not cache PDEs, inverted page tables allow faster TLB fill since only one look-up is required (assuming no conflict).
Inverted page tables are sometimes called and may be used as software TLBs. They act like TLBs but stored in memory and managed/sized by software. Like a TLB, there must usually be a mechanism to handle misses. Such mechanisms are OS-specific and can include using a linked list (so in addition to tag overhead, there would be overhead for a next pointer) and using a secondary (victim) hash table.