Electronic – Instruction cache’s purpose

cacheembeddedmicroprocessor

A processor is working at 100 MHz. The program memory interfaced to it can work only at 25 MHz maximum. Is there anyway we can fetch an instruction from it in one clock cycle of the processor ? I read that instruction cache can be used for this purpose. How instruction cache actually makes instruction fetch faster, while we cannot increase the speed of the slow memory ?

Best Answer

Processors have a fetch/execute pipeline which fetches, decodes and executes instructions. Different processors have different number of pipeline levels. A three stage pipe line is shown. Each instruction takes three clock cycles to execute, but because the stages can be paralleled, the net execution is 1 clock/instruction for the processor.

Fetch/Execute Pipeline

Ideally, we’d like to have programs execute as fast as possible. But this is not cost effective. A typical memory hierarchy is shown.

Memory Hierarchy

Fastest memory is registers. If you have many general purpose registers, the program can run from the registers.

Locality means most instruction fetches will be sequential. Only time it is not is when a branch is encountered. The present contents of the 3-stage pipeline must be trashed, so the new instructions can be loaded. If the branch is small, contents may be in registers. We loose 3 clock cycles, while the new instruction is fetched from the registers. The processor stalls until there is a new instruction to execute. Additional clock cycles are inserted.

If it is not in the registers, the address is sent to the cache. On-chip cache is a small amount of high speed memory. If it is a hit. On-chip cache is 1 or 2 clock cycles to access the next instruction. Still 1'ish clock cycles/instruction.

We skip multiple levels of cache. Just another level of buffering. No real point at 100MHz. L2 caches are just small amounts of fast SRAM.

If it is a miss, cache must request instruction from main memory. Main memory is a large amount of cheap memory. At 25MHz, each instruction would require 4 clock cycles to fetch. Instruction is read into cache and processor. 4 clocks to fetch and 3 to execute.

Locality means that the processor will also need the next instruction. The memory controller is set up to read a full block of adjacent locations from DRAM to the cache. The processor program counter will request the next location, cache miss, but the memory controller is already fetching this 2nd location from main memory, so overall delay is lower.

Finally, if program is not in main memory, it must be loaded into DRAM from storage. Big performance hit, as storage is sloooooow...

Ultimately, 1 clock cycle / instruction depends on the program and compiler / programmer.


Edit...

Cache has to be loaded. Typically, DRAM controllers have burst capacity, where first read takes 4 cycles, but the initial DRAM address is already supplied and subsequent reads take less than 4, say 2 cycles. A three-stage pipeline processor has to wait 7 clock cycles to get the first instruction (worse case scenario - branch, trash pipeline - cache miss, fetch from DRAM), but 2 or 3 for the next instruction that is already on the way to the cache.

A lot of programs are not sequential but loops or reused subroutines, where cache (and levels of cache) can improve performance. If a program was sequential, then cache would have no purpose. Processor would run at speed of DRAM read access.

No pipelining, sequential code, no burst capability, then your processor runs at 25MHz, with 3 wait states inserted into each instruction. Pipelining, loopey code (that fits in 100MHz registers or 100MHz cache) and burst DRAM read capability means the processor runs at 100MHz and 1 clock cycle/instruction.