Electrical – Understanding branch delay slot and branch prediction prefetch in instruction pipelining

computer-architecturemicroprocessormipsprocessor

Let me define:

  • Branch delay slot: Typically assemblers reorder instructions to move some instructions immediately after branch instruction, such that the moved instruction will always be executed, regardless whether branch is taken or not, without leaving the system in inconsistent state.

  • Branch prediction prefetch: predicting what will be the outcome of branching condition and then prefetching instructions from the resultant location, so that they will be immediately executed after branch instruction.

Now, lets consider below execution sequence (below, F: instruction Fetch, D: instruction Decode, X: eXecute, M: Memory access, W: Write back):

BRANCH   F   D   X   M   W
INSTR1       F   D   X   M   W
INSTR2           F   D   X   M   W
INSTR3               F   D   X   M   W

Usually branch condition is evaluated and executed in X stage. By this stage, INSTR1 and INSTR2 are already started and these are the instructions which can be affected by out choice of whether to use branch delay slots or branch prediction prefetch or both. I did not find any text to discuss this clearly. SO I tried to guess it as below:

  • When we use both, then instruction sequence would be:

    BRANCH: branch-instruction
    INSTR1: branch-delay-slot
    INSTR2: branch-prediction-prefetch
    
  • When we use only branch prediction, then instruction sequence would be:

    BRANCH: branch-instruction
    INSTR1: branch-prediction-prefetch-1
    INSTR2: branch-prediction-prefetch-2
    
  • When we use only branch delay slots, then instruction sequence would be:

    BRANCH: branch-instruction
    INSTR1: branch-delay-slot-1
    INSTR2: branch-delay-slot-2
    

Am I correct with this? Is it how this happen actually for different cases? Or there are some more details?

Best Answer

Yes, that could be what would happen, although I don't recall any architecture that combined prediction and delay slots: if you have prediction, it can run (lookup in a small memory) in parallel with the execution step, so no delay slots would be needed.