Some instructions last more than one cycle, even on RISC CPUs.
Sometimes external circumstances, for example slow memory, prevent the instruction from continuing. Sometimes the instruction is split into simpler operations because of limitations in the hardware, for example 64bits loads and stores. Sometime the instruction is too complex to fit in one cycle, for example divisions.
– Due to the limited number of read and write ports of the register file and/or the 64bits memory accesses, the LDD[A], ST[A], STD[A], SWAP[A] and LDSTUB[A] instructions are split in the decode stage, they occupy two or three pipeline stages and the cycle_dec signal counts cycles.
64bits loads and stores are split in two 32bits accesses, starting with high significant/low address. These instructions could have been avoided, as they are different from the genuine 64bits SparcV9 load and store instructions, which use one 64bits register instead of a pair of 32bits registers. They are useful for block copies and register windows save/restore traps, on CPUs with internal caches having 64bits busses (not for PIPE5 !).
– SWAP[A] and LDSTUB[A] generates a load then a store. The swapped register content is saved in the pipeline, between the EXE and MEM stages. These instructions are scheduled like a store double, and the data that will be stored is already present on the PLOMB bus during the load access.
– The integer multiply and divide instructions last several cycles but instead of being split into sub-operations, they stall the EXECUTE stage.
– The floating point instructions can also last several cycles, but, as the FP pipeline is separate, when the FPU is busy and unable to accept any new instruction, it stalls the IU at the DECODE level. Interleaving FP and integer instructions, instead of clustering them, augments performances.
– There is a last type of multi-cycle instructions, JMPL (and RETT). The next instruction address is PC+4 or, for branches and calls, it is some arithmetic relation between the current PC and an immediate value.
For JMPL/RETT, the next instruction is the result of an addition that depends on integer registers values. That addition is performed by the ALU during the EXECUTE stage. JMPL lasts two cycles in a way different from the other multi-cycles instructions, it forces a stall to enable the injection of the new PC from the EXECUTE stage. That stall is not necessary if the following instruction is also a JMPL (or a RETT).
In traditional CISC processors, most instructions lasted several cycles and the decoder used a ROM table for generating successive sub-operations. In PIPE5, there is neither sequencer nor microcode, simply a few multiplexers, counters and toggles.