Let’s start with something simple, the IU_SEQ integer unit.
The IU_SEQ architecture is sequential: Instructions are processed one by one and the CPU waits for the completion of each instruction before starting the following one.
Compared with the pipelined version, there is no data bypassing nor stalls due to instruction dependencies. This architecture is smaller and reaches higher frequency but, as most instructions last more than one cycle, performance is not good.
There is still some overlap between instruction fetch and execution, as the fetch engine has a one instruction advance over the execution part. This kind of architecture is sometimes called a two levels pipeline. A strictly non-pipelined SPARC is almost undoable because of the “delayed branch” principle and the infamous RETT instruction (more about that later).
After several dysfunctional implementations, I eventually borrowed the fetcher from the pipelined IU_PIPE5 architecture. They are therefore quite similar and most explanations about that part of IU_SEQ will apply to IU_PIPE5.
The fetch part manages the PC and nPC registers and does memory accesses. Instruction fetches are started as soon as a new address is generated and the instructions read are stored in a 2 cells FIFO until they are executed. The following signals are used to control the FETCH block:
as_c: Indicates that the instruction can be discarded from the queue.
na_c: Indicates that a new access shall be started at address npc_c.
For most instructions, the access can occur during the first cycle. The new address is either nPC+4 or the branch destination for conditional branches. For “JMPL” indirect branches, two integer registers must be read, and the na_c signal is generated one cycle later (in the EXECUTE state instead of the DECODE state).
The SPARC architecture use delayed branches. When an instruction is executed, it determines the address of the second following instruction. After a trap (or a RESET), two consecutive instructions are automatically loaded from the trap vector. After that, the CPU iteratively absorbs one instruction (as_c) and produces one address (na_c), which maintains the interlocking of the fetch engine with the delay slot.
Because of the pipelined nature of the PLOMB bus, the FETCH engine can push up to two addresses before getting the first data, it can make reading traces a bit difficult.
A finite state machine controls instruction execution. The main states are DECODE, EXECUTE, ADRS, TRANS
- Relative branches don’t leave the DECODE state.
- ALU operations go through the DECODE and EXECUTE states.
- Load/Store instructions go through the DECODE, EXECUTE (for effective address calculation), ADRS (for address generation) and TRANS (for data read). Load Double, Store Double and atomic Read/Write instruction pass two times through the EXECUTE, ADRS and TRANS states.
Trap handling uses the TRAP, TRAP2 and TRAP3 states for saving registers. Traps are like very special branches. They are immediate (not delayed) so that when an instruction or an external interrupt triggers a trap, the instruction fetch queue must be emptied (TRAP2 state) before jumping to the trap vector address.
Many parts of this CPU are described in VHDL functions and procedures instead of concurrent code (iu_pack.vhd). ALU operations are done in the “alu_op” function, Load/Store are described in the “lsu_op” function. It simplifies the IU_SEQ file, it also enables code reuse: The same ALU is used by the PIPE5 and the SIM architectures.
Integer multiply and divide instruction do not fit within the combinatorial ALU, they are implemented in the IU_MULDIV entity. When a MUL or DIV is encountered, the CPU stays in the EXECUTE state for several cycles, muldiv_sel and muldiv_ack signals indicates the beginning and the end of calculations.
The floating point unit is far more complex. This unit has its own set of registers, decodes independently instructions and executes them asynchronously from the integer part. Instructions are forwarded to the FPU during the DECODE state, the FPU is largely autonomous, except for FP load and store instructions which require a cooperation between the IU and FPU (Signals fpu_do_ack, fpu_di_maj, fpu_do, fpu_di) : Addresses are driven by the IU while data are controlled by the FPU.
General purpose (windowed) integer registers are stored in FPGA embedded RAM blocks: IU_REGS_2R1W. For 8 windows, 132 32bits registers are needed. The register file can perform simultaneously up to two reads and one write. Two reads are needed for ALU operations and dual indexes memory accesses (e.g. LD [R1+R2],R3). This register bank entity is re-used by the pipelined version, IU_SEQ could instead alternate reads and writes.
Stores are a bit special as they sometimes need three registers: ST R1,[R2+R3]. A second register read access is done in the EXECUTE stage after the DECODE stage, the first port of the register bank is reused.
IF (phase=sEXEC OR phase=sADRS) AND cat.unit=LS THEN n_rs1_v:=n_rd; END IF;
Special purpose registers like PSR, TBR or RY are inferred in the sequential process at the end of IU_SEQ. Register updates are triggered by a set of “_maj” signals, based on instruction decoding, which is done by the “decode()” procedure (IU_PACK), the “cat” structure sorts instruction categories, their effects and dependencies.
As-is, the instruction and data busses are not fully utilised and can be connected together (with a PLOMB_MUX) with minimal loss of performance. This architecture would be a bit saner with a single merged instruction/data bus.
From a software point of view, IU_SEQ actually works as software considers as normal CPU behaviour: Instructions are executed one by one, in program order, independently and when a trap occurs, the instruction is simply interrupted.
Most of the emergent complexity of the other architectures deals with imitating sequential behaviour with parallel blocks.
Finally, IU_SEQ is useful for comparing behaviours with IU_PIPE5 when trying to boot a recalcitrant operating system.