Electronic – MIPS (PIC32): branch vs. branch likely

assemblymips

It's been a while since I've looked at the recent Microchip processors & I've been trying to learn a little bit about the PIC32 MIPS instruction set. I noticed there are two sets of branch instructions; the programming guide says this:

4.1.3.2 Branch Delays and the Branch Delay Slot

All branches have an architectural delay of one instruction. The instruction immediately following a branch is said to be
in the branch delay slot. If a branch or jump instruction is placed in the branch delay slot, the operation of both
instructions is undefined.

By convention, if an exception or interrupt prevents the completion of an instruction in the branch delay slot, the
instruction stream is continued by re-executing the branch instruction. To permit this, branches must be restartable;
procedure calls may not use the register in which the return link is stored (usually GPR 31) to determine the branch target
address.

4.1.3.3 Branch and Branch Likely

  • Branch instructions execute the instruction in the delay slot.

  • Branch likely instructions do not execute the instruction in the delay slot if the branch is not taken (they are said to
    nullify the instruction in the delay slot).

Although the Branch Likely instructions are included in this specification, software is strongly encouraged to
avoid the use of the Branch Likely instructions, as they will be removed from a future revision of the MIPS
Architecture.

Could someone clarify what they're saying here? Do they differ in the # of CPU cycles taken in each case?

I'm used to the branching architectures of the 8- and 16-bit PICs ("branch skip") where you either execute or jump over an instruction (this sounds like Branch Likely), and I'm used to the branching architecture of the TI C2000 DSPs (where all you get is a branch or branch conditional to an address). But I don't quite understand how to map the MIPS branch instructions onto my existing knowledge.

Best Answer

MIPS is one of several RISC (reduced instruction set computers) architectures that are designed to execute one instruction per clock cycle. In order to achieve this, the original MIPS processors had a five-stage pipeline:

enter image description here

The abbreviations are in the above figure are: IF (Instruction Fetch), RD (Read from register file), ALU (Execute instruction in Arithmetic Logic Unit), MEM (Read/write Memory access), WB (Write back to register file). The vertical axis is successive instructions; the horizontal axis is time.

Because the MEM stage occurs after the ALU stage, RISC machines like MIPS don't do arithmetic or logical operations on memory, but only on registers. For this reason they are also referred to as load/store architectures.

There are several hazard conditions where the pipeline can stall and cause a penalty in the over instructions per cycle (IPC) value. A data hazard occurs, for example, when an instruction attempts to use data in one of the registers before it has been loaded into the register. For example:

lw $3, 100($2)
add $1, $2, $3

The data is not loaded until the MEM stage of the first instruction, which is too late for it to be available for the EX stage of the second instruction.

Control hazards occur because on any branch taken, the instruction immediately after the branch is always fetched from the instruction cache. If this instruction is ignored, there is a one cycle per taken branch IPC penalty.

The solution for the MIPS architecture was the "Branch Delay Slot": always fetch the instruction after the branch, and always execute it, even if the branch is taken. This gets a little weird when writing MIPS assembly code, because when you are reading it, you have to take into account the instruction after the branch is always going to be executed. The trick in writing efficient code is to put in an instruction that will be useful as part of the loop that is being taken executed, but do no harm if the branch is not taken.

The MIPS designers were counting on compiler writers to write clever enough code generators to handle this efficiently. However many do not (including Microchips C32 compiler, based on GCC), and just put NOP's after every branch, wasting both code space and cycles.

So in the R4000 architecture, MIPS added Branch Likely instructions which still always fetch the instruction after the branch from the instruction cache, but only execute it if the branch is taken (opposite of what one might expect). Compilers can then always fill the branch delay slot on such a branch.

A loop like:

loop:
    first instruction
    second instruction
    ...
    blez t0, loop
    nop

can be turned into:

loop:
    first instruction
loop2:   
    second instruction
    ...
    blez t0, loop2
    first instruction

The repeated "first instruction" after the branch is always executed if the branch is taken (and becomes part of the next go-around of the loop. This instruction is ignored if the branch is not taken (incurring a slight IPC penalty).

However as it turns out, trying to include this feature in high-performance designs has been a pain in the neck due to the complexity in getting rid of the result of the ignored instruction. Therefore it has been deprecated.