Before digging into the pipelined version, let’s examine the SPARC instruction set.
SPARC is a descendant from the early RISC designs made at university of Berkeley (1980…1984).
There are sometimes confusions between a CPU implementation and the instruction set architecture. It is possible to build very different implementations of a given instruction set: x86 is a good example as it debuted in simple multicycle microcoded CPUs and it is now implemented by advanced speculative, superscalar, out of order, pipelined architectures. The misleading claim that “there is a RISC deep inside modern x86” adds to the confusion as RISC is about Instruction Sets, not micro-operations and internal encoding.
Ideally, an instruction set should not mandate any particular implementation. Alas, there are always some hardware details surfacing. For example, delayed branches are a burden for software, they are a side effect of the implementation of simple pipelined microprocessors.
RISC used to mean “Reduced Instruction Set Computer”, as an opposition to “Complex…”: Removing seldom used instructions for reducing chip size and reaching higher frequency.
RISC instruction sets are also developed with pipelining in mind: [Almost] All instructions should have comparable complexity and should be divided in few simple elementary steps (simple = low propagation time = fast clock). For example instruction decode is one of these steps, it is kept simple by using a fixed width and regular instruction format.
When a sequenced CPU executes an instruction (for example our own IU_SEQ), all parts of the CPU are not used simultaneously: Fetch, execute, memory accesses are alternated.
With pipelining, almost all parts of the CPU could be used simultaneously (well, when the cache manages to provide instruction and data…). This optimal use of hardware resources and the reduction of the number of cycles per instruction is what allowed the RISCs to cream the CISCs in the late 1980’s. This performance gap was dramatically reduced when CISCs microprocessors were eventually pipelined (i486, MC68040…).
SPARC Instruction formats
- ADD R1,R2,R3 R3=R1+R2
- ADD R1,R2,Simm R3=R1+Signed Immediate value
Operations are ADD, SUB, MUL, DIV, OR, AND, NOT… Instructions require 1 or 2 register reads, 1 register write. Register R0 is hardwired to 0.
Update of the NZVC condition codes is optional: ADD/ADDcc, SUB/SUBcc
SAVE and RESTORE instructions are like ADD but the register window is changed during execution so that the source registers are read before the change and the destination is written after the change. Typical use is with the O7/I7 registers which are used (as an ABI convention) as Stack Pointers and Frame Pointers. SAVE copies the stack pointer and reserves the area for locals in the stack (like, for x86, the ENTER instruction or the PUSH EBP / MOV ESP,EBP / SUB ESP,xxx prologue).
- SETHI Imm22,R1
This instruction sets the high 22 bits of a register. Why 22 bits? Because this is what the instruction encoding allowed.
The canonical form for initializing a register with a 32bits value is to place an OR after a SETHI instruction:
Load And stores
- LD[R1+R2],R3 ST R3,[R1+R2]
- LD[R1+Imm],R3 ST R3,[R1+Imm]
Load and stores can move 8, 16 or 32bits or 64bits between memory and registers.
Accesses must be aligned (else a trap is triggered, the access is emulated…).
The 64bits variants can be implemented with 64bits datapaths or split over two cycles. Same for 32bits and 64bits transfers to/from FPU registers.
- Bicc Destination
Conditional branches are based on the condition code flags (NZVC).
These branches are delayed, the instruction following the branch is executed before jumping to the destination address.
Similar FBfcc instructions are for the floating point conditional branches.
- CALL relative address
Call is an unconditional delayed branch which also saves the current PC value into the R15=O7 register.
Bicc, FBFcc and CALL instructions can be decoded with minimal effort and do not need any access to integer registers, it is therefore possible to process these instructions directly during instruction decode, minimizing branch latency.
The JMPL [R1,R2],R3 instruction is used for many purposes
As a register indirect branch, as a return from subroutine or during trap exit for going back to the interrupted instruction.
Special registers are read and updated by a few special instructions: WRPSR, WRY,WRWIM…
To simplify pipelining, the update of special registers may be delayed by up to three cycles, so that these instructions must be followed either by NOPs, or by instructions that do not depend of the updated register.
(Of course, imposing such constraints with words like “undefined behaviour” in the standard is a bad idea.)
The floating point instruction set include the classical ADD/SUB/MUL/DIV/Convert operations on the floating point registers either in simple or double precision. There is no direct transfer possible between integer and floating point registers, everything must be done through memory accesses, this is a bit a legacy from the time when the floating point coprocessor was on a separate chip.
This is almost everything. The SPARC32 instruction set is pretty straightforward. There is no special instruction for managing the cache and MMU registers, everything is done through memory accesses with special ASI codes (see the TEMLIB manual). MMU registers are not part of the IU.
Many other similar simple RISC instruction sets are still used nowadays: MIPS, DLX, OpenRISC, MicroBlaze, NIOS, MICO… They allow straightforward implementation in a scalar pipeline, like IU_PIPE5, which I will describe in following articles. The main downside of flat instruction encoding was code density, which, in SPARC case, is not good. Modern solution is to provide variable length instruction encodings (ARM Thumb, MIPS MicroMIPS, PowerPC VLE…)