Electronic – Is depth and number of stages the same measure for a CPU pipeline

computer-architecturecomputerscpuhardware

Is it true that the depth of a CPU pipeline and the number of stages of a computer pipeline are different measures? There is not much info about it if I google or look in my books. I think that depth is a measure of the overlapping of instructions while number of stages is a hardware constant. When you increase the number of stages, you usually make the CPU faster but it is with dimishing margin. I looked at Almdahl's law about this and the book "Computer Organization and Design" by Pattersson and Hennesay.

enter image description here

The more stages, the larger the depth but it is stated that there can be optimal number of stages or optimal depth:

According to (M.S. Hrishikeshi et. al. the 29th International Symposium on Computer Architecture)

The difference between pipeline depth and pipeline stages; is the
Optimal Logic Depth Per Pipeline Stage which about is 6 to 8 FO4
Inverter Delays. In that, by decreasing the amount of logic per
pipeline stage increases pipeline depth, which in turn reduces IPC due
to increased branch misprediction penalties and functional unit
latencies. In addition, reducing the amount of logic per pipeline
stage reduces the amount of useful work per cycle while not affecting
overheads associated with latches, clock skew and jitter. Therefore,
shorter pipeline stages cause the overhead to become a greater
fraction of the clock period, which reduces the effective frequency
gains.

Best Answer

I would argue that "depth" is a measure of instruction overlap in the sense that it indicates the amount of time (number of clock cycles) that must elapse before the result of one instruction can be used by a subsequent instruction.

However, there might be additional hardware stages (instruction prefetch and decode, memory write, etc.) that do not contribute to this latency, so the "number of stages" could very well be larger than the "depth".

The "optimal depth" concept arises from the fact that doing a small amount of work in each clock cycle (having a large number of stages) allows a higher clock frequency, but also increases the depth (latency). Eventually, this becomes a liability, because it can get to the point where the compiler has no useful instructions that it can schedule to fill the gaps imposed by the latency.

This was not much of an issue for early supercomputers, because they were focused on processing large arrays or vectors of data that had no data dependencies among the individual operations, and so they could afford to have deep pipelines that ran at high frequencies. However, "random" calculations on scalar data generally have more data dependencies and much shorter latency requirements, which leads to the need to keep the pipeline depth relatively small. If not dealt with, this "scalar bottleneck" can severely limit the overall performance of an application, which is what Amdahl's Law is all about.