IU : Pipelined : Stall and Bypass

Integer registers are read at the beginning of the EXECUTE stage and are updated at the end of the WRITE stage.

Several instructions using the same registers as sources and/or as destinations may overlap. After checking which instructions are dependent of one another, we have two options:

  • Forward results as soon as they are available. This is the bypass method. PIPE5 does that for all the integer arithmetic/logic instructions.
  • Stop the pipeline until the instruction that updates the dependant register has passed the WRITE stage. This is the stall method. It is used when a load from memory is followed by an instruction that depends on the data read.


Integer registers are arranged in a 136*32 memory (when we have 8 register windows). They use stall and bypass.
Our current floating point unit: FPU_SIMPLE is not pipelined. There is no bypassing, no dependency checks.

The remaining registers of the IU: PSR, WIM, TBR and RY, are managed a bit differently:

  • PSR: Processor Status
  • WIM: Window Invalid Mask
  • TBR: Trap Base register
  • RY: Register Y. Bits [63:32] of integer multiplications and divisions.

WIM and TBR are seldom updated and do not need fast read/write access. They are modified by the WRWIM and WRTBR instructions at the WRITE stage. The SPARCv8 standard imposes that the three instructions following WRWIM or WRTBR must not depend on the register value as it is ‘undefined’ during that time (page 131).
This software constraint was made to simplify hardware as no dependency checks nor is bypass hardware needed, as long as the pipeline is not too deep. Of course, this is a stupid constraint (cough, NetBSD bug, cough).

RY is used for integer multiplication and division. The WRY instruction has similar constraints as WRWIM and WRTBR. This three instruction delay is even less acceptable as multiplications and divisions can be used more often and in user code.

PSR also has the same constraints with the WRPSR instruction, but it is the most complex to implement. This register gathers many different and unrelated flags:

  • Register window index
  • User/Supervisor mode
  • Interrupt masking
  • NZVC flags

For example, the NZVC flags are used for conditional branches, a CMP instruction is followed by a Branch<equal, higher, lower, carry, overflow…>. Imposing a three instruction delay between a compare which update flags and a branch which uses them, would give awful performances.
The PSR register is therefore updated early during the EXECUTE stage which maintains a working copy of this register. The values are then propagated through the pipeline to the WRITE stage which keeps the ‘final’ value of the register for committed instructions. When a trap occurs, the PSR value is recovered from the “last good version” at the WRITE stage.
The CPU does the same for RY.
This method is used in all 32bits SPARCs, you can see it in pipeline diagrams of the MicroSparc, SuperSparc, SparcLite, LEON…

Bypass logic

Bypass is made of comparators and multiplexers.
The comparators detect instruction dependencies by matching the destination register RD with the source registers RS1 and RS2 of following instructions.
R0 is always equal to zero, it must be ignored.

Because of the windowed register file, we have 136 integer registers, which requires 8bits comparators. The SAVE and RESTORE instructions last 1 cycle to keep subroutine calls fast (which is the whole point of register windows). We could limit the comparators to 5bits, 31 registers by draining the pipeline each time these instructions are used.

The comparators are in the DECODE stage. The multiplexers are in the EXECUTE stage. Or maybe not.
The EXECUTE stage is quite complex as the registers contents must pass through the operators and multiplexers of the ALU, multiply/divider, shifts… The bypass multiplexers are placed before the ALU, which adds delay. It is possible to move a part of these multiplexers to the DECODE stage. When two consecutive instructions use the same register, the result must be immediately re-injected. When there is a gap, the bypassing can be done earlier, in the DECODE stage.

ADD R1,R2,R3
ADD R3,R4,R5


ADD R1,R2,R3
ADD R3,R4,R5

With the BYPASS_DEC constant, we can choose either implementation for PIPE5.

PIPE5 bypass options
On Spartan6 FPGAs, where 4- or 5- wide multiplexers can be implemented efficiently, there is no real difference, moving part of the bypass to the decode stage does not really change the propagation time of the EXECUTE stage. With longer pipelines, with larger multiplexers, it could make a difference (see LEON3)


Stalls are managed at the DECODE stage. The following circumstances stop the pipeline:

  • First, the “natural” flow of the pipeline. Delays in memory accesses, multi-cycle instructions…
  • Integer register dependencies which cannot be solved through bypassing. Typically using a value after a load.
  • Floating point dependencies. A conditional branch on floating point condition cannot be executed while a floating point compare instruction is still being calculated (fcc[], fccv signals).
  • Floating point synchronisation. FPU load and store instructions depends both on integer and floating point units.
  • BSD bug avoidance for WRPSR/WRWIM instructions.
  • JMPL and RETT. A JMPL at the execute stage must wait until the instruction in the DECODE stage is known. Because some things are crazy.


The purpose of the hardware exposed here is to hide instruction dependencies.
Externally, the CPU behaves as if each instruction was executed in sequence, independently, and this is exactly what the software expects.

Exposing internal pipeline details, like the three instructions delay, the branch delay slot, or the lack of interlocking of first generation MIPS processors, is bad for general purpose CPUs. Not only the burden on software is too high, but while different microarchitectures have different constraints, they will still have to keep software compatibility.

Leave a Reply

Your email address will not be published. Required fields are marked *