Our CPU tries to be compatible with the 32bits SPARC Version 8 standard.
The specifications can be downloaded there :
SparcV8 is a 32bits RISC. Version 9 and later editions describe the 64bits variant adopted by current CPUs from Sun/Oracle (UltraSparc, Sparc Tx) and Fujitsu (SPARC64).
Versions 1 to 6 of the standard remained drafts, the first SPARC (MB86900) and a few others were V7. The most notable evolution between V7 and V8 is the introduction of integer multiply and divide instructions (there are compiler options for enabling their use).
Our design strives to follow the specification, in order to run existing software, compilers and operating systems. Nevertheless, no guarantee is provided that the CPU is actually compliant with any standard.
SPARCs are ‘traditional’ RISC CPUs (similar to MIPS, PowerPC…), the most notable quirks are the windowed integer registers and delayed branches, which are described below.
(Let’s suppose you have already read or written some assembly code.)
Instructions are all 32bits wide
There are 32 integer registers directly accessible.
Load/Store instructions are distinct from arithmetic and logic instructions.
Memory is byte addressable but accesses should be aligned.
There is an optional floating point unit based on the IEEE P754 standard.
The floating point unit has 32 single precision registers which can be combined into 16 double precision registers, they are independent from the integer part.
Windowed integer registers
Instructions include one or several 5bits fields for selecting among 32 registers. 7 of these registers are global and always accessible, 24 are part of a circular buffer which stores several sets of registers and R0 is always equal to zero. In most implementations, there are 8 windows of 16 registers + 7 global = 135 general purpose integer registers.
The OUT registers of a window are the IN registers of the next one, whereas the LOCAL registers are unique.
Instead of using the stack for passing parameters across function calls, the parameters are passed through the IN/OUT registers and up to 8 temporary local variables are readily available.
The SAVE instruction, called at the beginning of a procedure, reserves a new register window while the RESTORE instruction at the end does the inverse (Simple leaf functions can avoid saving/restoring registers).
Programs can nest procedure calls at will. The WINDOW_OVERFLOW and WINDOW_UNDERFLOW traps are triggered by the SAVE and RESTORE instructions when no more register set is available for SAVE or when all the sets have been freed by RESTORE. The operating system copies to/from memory register contents and hides to application software the actual number of register windows.
The windowed registers behave like a cache of the integer stack. The 7 global registers store temporary values and application-global constants and pointers.
When the operating system switches between applications, it must save all windows at once.
Our implementation can be configured with any number of windows between 2 and 32. Alas, as operating systems must be adapted, it is usually simpler to keep the 8 windows default.
Traps automatically do a SAVE operation and get a fresh set of registers (no ‘manual’ register saving is necessary for simple trap processing routines), there must be at least one free window at all times.
The concept of windowed registers was borrowed from the RISC I&II projects at Berkeley which were SPARC’s ancestors. Other CPU families used this concept, for example the AMD29000 and Intel i960.
Contrary to previous generations of CISC CPUs which used microcode and executed instructions in several cycles (VAX, MC68K…), the RISC CPUs were conceived (in the 1980’s) around the constraints of pipelining.
Most instructions should have comparable complexity and should be divided in sub-operations which can be executed concurrently : FETCH, DECODE, EXECUTE…
When an instruction is in the DECODE stage, the previous one is in the EXECUTE stage and the next one is in the FETCH stage (when the instructions are fetched from the cache without wait states).
On SPARCs, branches are very simple instructions which can be handled at the DECODE stage (no integer register is accessed, only the flags are).
In an architecture where branches are not delayed, the instruction following the branch, already fetched, must be discarded from the pipeline if the branch is taken, this wastes one cycle.
With delayed branches, this instruction is kept and the branch occurs one cycle later.
In assembly code these instructions are usually indented to the right (for example, see the Linux source, in /arch/sparc/ .S files).
The SPARCs introduced an additional refinement with « annulation ». Depending on the opcode and the branch result, the instruction in the delay slot is executed or is annulled (e.g. acting as a NOP).
for (i=1,j=1;i<10;i++) j=j*i;
Can be compiled as:
OR %R0,1,%L0 OR %R0,1,%L1 MULS %L0,%L1,%L1 CMP %L0,10 BL,A -2 ADD %L0,1,%L0
In this loop, the last increment is annulled when the branch is not taken. (Well, actually, the condition should be tested even for the first iteration.)
A few others CPU families adopted branch delays : For example, MIPS have one branch delay slot, the HP PA-RISC had one slot and a « nullification » mechanism similar to SPARC’s annulation.
A bit like windowed registers, delayed branches is an optimization only effective on very simple implementations and become a burden with more modern, complex designs (with dynamic branch prediction, speculation, out of order execution…).
I’ll probably write more detailed articles on these two subjects later.