Electronic – MIPS pipelining and data hazards

cpu

I'm trying to understand pipelining on a 32 bit 5 cycle MIPS chip. If register writes were not performed in the first half of the cycle and reads in the second half, how many additional pipeline bubbles would be caused by data dependencies?

Best Answer

I'm assuming a few things because I'm mot sure that you gave enough info for an exact answer. I will describe a system that will allow me to give you an answer.

Suppose this MIPS chip has an ADD instruction that adds two registers, and stores the result in a third. In the pipeline, the data would be read from the register file on cycle 3, and the result would be written on cycle 5. The assembler code for adding three numbers would generate a hazard:

(Add registers $11, $12, and $13 and store the result in register $10)

add $10, $13, $12   # $13 + $12 -> $10 
add $9, $10, $11   # $11 + $10 -> $9

The code has a Read after Write hazard on register $10. The processor would detect the data hazard and stall the second addition one cycle. This assumes that each cycle begins with a write, and ends with a read. If the order was reversed, then the data wouldn't be in the register in time for the 5th clock reading time, and an additional stall cycle would be necessary to get the correct result. Below is an example time-line for a "Write, then Read" system.

Read first add instruction
Read second add instruction
Read operands for first add instruction
Pipeline stall for second add
Write result of first add, read operands for second add
...
Write result of second add

Wikipedia has a good article about pipeline hazards

Related Solutions

Electronic – Does CPU pipelining always imply a multicycle datapath

A pipelined CPU implies a multi-cycle datapath, precisely because it takes five clock cycles for an instruction to go from Fetch to Writeback.

Where I'm getting confused is here "unlike the multi cycle cpu, the pipelined datapath requires that every instruction use all five stages of execution."

You should finish reading the next paragraph you're quoting. That requirement is just to prevent two instructions from finishing at the same time.

suppose we use the latencies from our multi-cycle cpu, and we try to run a load instruction followed by an add instruction. the load instruction will require five cycles to execute, and the add instruction will require four cycles. so, if we start running the load instruction on cycle 1, it will finish execution on cycle 5. we are pipelining, so we can start running the add instruction on cycle 2, and it will finish on cycle 5. this is a problem: we have two instructions finishing on cycle 5: they will both try to write to the register file on cycle 5. this is a problem, because our register file only has one write port.

Electronic – How to a CPU deliver more than one instruction per cycle

First, as Keelan's comment and Turbo J's answer point out, the measurement was 113,093 Dhrystone MIPS not native MIPS.

The Ivy Bridge microarchitecture of the i7 3630QM can only commit 4 fused µops per cycle, though it can begin execution of 6 µops per cycle. (The number of fused µops in a trace of code is roughly equal to the number of instructions; some complex instructions are decoded into multiple µops that are not fused and some pairs of instructions can be fused into a single µop, e.g., a compare immediately followed by a conditional jump.)

Two of your speculations on how multiple instructions can be executed in a single cycle are quite valid and have been used in actual processors. Your first speculation, that a faster internal clock is used, was used in the original Pentium 4's "fireball" ALUs. These ALUs were clocked at twice the frequency of the rest of the core, which was already relatively high.

(This was accomplished by using a staggered ALU in which the lower half of an addition was done in one cycle, allowing a dependent operation to use the lower half of the result in the next cycle. For operations like add, xor, or left shift which only need the lower half of operands to produce the full lower half of the result, such staggering—also known as width-pipelining—allows single cycle result latency as well as single cycle throughput.)

A somewhat related technique, cascaded ALUs, was used by the HyperSPARC. The HyperSPARC fed the results from two ALUs into a third ALU. This allowed two independent and a third dependent operation to be executed in a single cycle.

Your speculation that "there are multiple concurrent pipelines per core" is the other technique that has been used. This type of design is called superscalar and is by far the most common means of increasing the number of operations executed in a single cycle.

There are also a few other odds and ends of instruction execution that might be worth noting. Some operations can be more efficiently performed outside of the ordinary execution units. The technique of move elimination exploits the use of register renaming in out-of-order processors to perform move operations during register renaming; the move simply copies the physical register number from one position in the renaming table (called a register alias table) to another. Not only does this effectively increase execution width but it also removes a dependency. This technique was used early with the stack-based x87, but is now broadly used in Intel's high performance x86 processors. (The use of destructive, two-operand instructions in x86 makes move elimination more helpful than it would be in a typical RISC.)

A technique similar to move elimination is the handling of register zeroing instructions during renaming. By providing a register name that provides the zero value, a register clearing instruction (like xor or subtract with the both operands being the same register) can simply insert that name into the renaming table (RAT).

Another technique used by some x86 processors reduces the cost of push and pop operations. Ordinarily an instruction using the stack pointer would have to wait a full cycle for a previous push or pop to update the value for the stack pointer. By recognizing that push and pop only add or subtract a small value to the stack pointer, one can compute the results of multiple additions/subtactions in parallel. The main delay for addition is carry propagation, but with small values the more significant bits of the base value—in this case the stack pointer—will only have at most one carry-in. This allows an optimization similar to that of a carry-select adder to be applied to multiple additions of small values. In addition, since the stack pointer is typically only updated by constants, these updates can be performed earlier in the pipeline separately from the main execution engine.

It is also possible to merge instructions into a single, more complex operation. While the reverse process of splitting instructions into multiple, simpler operations is an old technique, merging instructions (which Intel terms macro-op fusion) can allow the implementation to support operations more complex than those exposed in the instruction set.

On the theoretical side, other techniques have been proposed. Small constants other than zero could be supported in the RAT and some simple operations that use or reliably produce such small values could be handled early. ("Physical Register Inlining", Mikko H. Lipasti et al., 2004, suggested using the RAT as a means of reducing register count, but the idea could be extended to support loading small immediates and simple operations on small numbers.)

For trace caches (which store sequences of instructions under particular assumptions of control flow), there can be opportunities to merge operations separated by branches and remove operations that produce unused results in the trace. The caching of the optimizations in a trace cache can also encourage performing optimizations such as instruction merging which might not be worthwhile if they had to be done each time the instruction stream was fetched.

Value prediction can be used to increase the number of operations that can be executed in parallel by removing dependencies. A stride-based value predictor is similar to the pop/push optimization of a specialized stack engine mentioned earlier. It can compute multiple additions mostly in parallel, removing the serialization. The general idea of value prediction is that with a predicted value, dependent operations can proceed without delay. (Branch direction and target prediction is effectively just a very limited form of value prediction, allowing the fetching of following instructions which are dependent on the "value" of the branch—taken or not—and the next instruction address, another value.)

Best Answer

Related Solutions

Electronic – Does CPU pipelining always imply a multicycle datapath

Electronic – How to a CPU deliver more than one instruction per cycle

Related Topic