IU : Pipelined : PIPE5 presentation

This is the traditional RISC implementation, the one you find in hundreds of CPU cores, thousands of books, millions of computer architecture courses, gazillions chips.
The complex parts are not in the pipeline itself. Problems arise when you have multi-cycle instructions, when it must cooperate with a FPU, when you have side effects between instructions, when you handle all kinds of exceptions and want to use a MMU, when you plug a debugger, when you try to push the frequency higher, when it is a SPARC.

pipe5

Ceci n’est pas une pipe

The pipeline is not an unidirectional flow where instructions enters on one side and are dumped on the other side. Many signals travel backwards, making each stage dependant from the others. For example, in our case, the program counter is managed by the DECODE and EXECUTE stages and addresses are forwarded back to the FETCH stage. Bypassing also works backwards the instruction flow.

  • First, the instruction must be extracted from memory.
  • Second, the meaning of that instruction must be determined to figure out what to do.
  • Third, the operations required to complete that instruction must be performed : additions, logic operations…
  • Fourth, access data memory if necessary.
  • Fifth, the instruction result may be written back if nothing abnormal has occurred.

Simpler pipelines exists (3/4 levels) but they usually sustain a lower frequency or need several cycles for data memory accesses. For example, as effective addresses are calculated during the third (execute) stage, it could be possible to start memory accesses from there.

Pipeline stages

FETCH

The fetch stage controls the instruction bus. It pushes new accesses as they are requested by the following stages of the pipeline.
This block is very similar to the fetch part of IU_SEQ (described previously). It can buffer up to two instructions, which is a natural fit for the two instructions automatically loaded after a trap for filling up the pipeline.
In addition to the address, the IU indicates for each access its current state: User or Supervisor. The Memory Management Unit checks all accesses to prevent unauthorised behaviour. (The meaning is “I want to access that memory while being in [user|supervisor] mode”, not “I want to read that [user|supervisor] memory area”)

More advanced CPUs have an autonomous pre-fetching machine that generates new accesses and let following stages decide whether these accesses were actually needed. Our FETCH unit is driven by the decoder and all instructions fetched are actually processed… except when a trap occurs.

DECODE

The decode stage is quite complex, it is responsible for:

  • Perform instruction decoding and classification.
  • Check stalls and instruction dependencies.
  • Steer floating point instructions to the floating point unit.
  • Set the register bank read ports.
  • Do instruction sequencing for the multicycle instructions (LDD, STD, SWAP…)
  • Generate addresses to the FETCH block for next instructions.

Conditional branch instructions in SPARC don’t need any register access and can be completed during the DECODE stage. Indirect jumps (including return from subroutine) requires register accesses and are processed in the EXECUTE stage.

Instruction decoding is done by the decode() procedure (iu_pack.vhd) which categorises instructions into the “cat” structure :

mode : load/store/double/jmp…
unit : ALU/MDU/FPU/LSU
size : memory access size
m_reg, m_ry, m_psr… : Modified registers
r_reg,r_ry, r_psr… : Read registers
<plus several other silly fields>

Instruction dependencies are checked at the decode stage which stalls the pipeline when the problem cannot be solved by bypassing. For example LD[R1],R2 followed by ADD R2,R3,R4 : The CPU must wait for the end of the memory access before starting the ADD.

EXECUTE

The EXECUTE stage does all the calculations of integer instructions. The central part is the ALU (op_alu() in iu_pack.vhd) which is complemented by a separate integer multiplication/division unit (iu_muldiv.vhd). The pipeline is stalled while a multicycle multiplication or division is being calculated. To manage dependant instructions the execute stage also has bypass hardware.
Registers are updated at the WRITE stage. If two consecutive dependant instructions are fetched, for example ADD R1,R2,R3 followed by ADD R3,R4,R5, without bypassing, the CPU would have to wait for at least 3 cycles until the first instruction is completed and R3 is updated. With bypassing, these two instructions can be executed back to back.

MEMORY

The MEMORY stage performs data accesses, like the FETCH stage performs instruction accesses. For instruction that do not use data, well, there is nothing to do.

(Complex arithmetic/logic instructions like multiply, divide and shifts could be extended to the memory stage for higher throughput.)

For this simple CPU, a memory access is an unrecoverable action. Before starting any data access, the CPU checks that the instruction at the WRITE stage does not trap. Without that, spurious data reads or write could be triggered.

WRITE

This stage updates the registers and manage traps.
CPU registers final values for completed instructions are stored here, including the special registers PSR, RY and PC. When a trap occurs, the last known good values are restored from the WRITE stage bookkeeping, the CPU state just before the offending instruction is recovered.

A few important signals

All the pipelined signals are kept in a VHDL record : pipe_exe, pipe_dec, pipe_wri, pipe_mem. Not all the signals defined in the record are used in all levels of the pipeline, they are automatically pruned during logic synthesis. The .v field (pipe_dec.v, pipe_exe.v, pipe_mem.v…) indicates that the corresponding level of the pipeline is not empty. The .trap.t field indicate that a trapping condition has been detected for that instruction, the trap code is in .trap.tt. The cycle signal is used for sequencing multicycle instructions (store, load double, store double, swap). Some signals are propagated from one level to the next one, others are kept, even if there is a stall cycle in between, with .v=0 . The as_xxx_c… signals pushes instructions through the pipeline: as_exe_c from DECODE to EXECUTE, etc… Like for IU_SEQ, as_dec_c transfers instructions from FETCH to DECODE while na_c pushes new addresses from DECODE to FETCH.

FPU synchronisation

The FPU pipeline is both separate and synchronised with the integer pipeline. “Real” mathematical instructions are handled autonomously by the FPU. Some other instructions, particularly the memory accesses (LDF, LDDF, STF, STDF), need a direct synchronisation between the floating point register accesses and the integer pipeline. FPU instructions are fed from the DECODE stage and are simultaneously handled by the integer and the floating point units. The FPU must wait until the instruction is at the WRITE stage of the integer part before updating its registers, because the preceding instruction may trap, and the pending FP instruction would be cancelled. “fpu_val” commits FPU instructions while “fpu_flush” discards them.

Conditional branches on FP conditions (Fbfcc instructions) also checks that the floating point unit is synchronised with the integer unit.

SEQ

With PIPE5, most instructions last 5 cycles as they traverse each stage of the pipeline. For SEQ, ALU instructions last 2 cycles and branch instructions last 1 cycle.
PIPE5 is faster because instructions can overlap, approaching one cycle per instruction. SEQ can nevertheless beat PIPE5 in one occasion: Traps. Shorter pipelines provide shorter interrupt and trap latency. SEQ do not need to flush the pipe and discard partially completed instructions, it does not need to wait for trapping instructions until they reach the WRITE stage.

Conclusion

Many different architectures can be built from an instruction set definition. For many RISCs, the 4 or 5 levels single issue pipeline is a very efficient implementation, particularly when memory latency is not very high and when there is no FPU (our FPU is larger than PIPE5).
Simpler architectures (sequential) are too slow compared to their size. Improved architectures like in-order superscalar require a lot of extra complexity to achieve significant increase of performance (diminishing returns).

…Our exploration of PIPE5 will continue in the next article.

Leave a Reply

Your email address will not be published. Required fields are marked *