Faster stores for IU_PIPE5

Just a detail from latest revision.

As IU_PIPE5 has only two read ports on the register bank, store instructions like ST R1,[R2+R3] last two cycles because three registers are read:

  • During the first cycle, the address is generated by the ALU : R2+R3
  • During the second cycle, the data to store is retrieved : R1

To accelerate stores and save one cycle, the easiest method is to add a read port to the register file. Here is, for example, the Fujitsu MB86830, an embedded SPARC, with 3 read ports, one of them is dedicated to stores :

MB86830

MB86830 (figure copied from the datasheet)

But that’s not fun.

There are two other methods:

(A) Many store instructions need only two registers, for example ST R1,[R2]
These stores could be handled in 1 cycle. Floating point stores also requires only up to two integer register accesses.
(B) Re-use some of the values available through the pipeline bypass as stores are often placed after instructions modifying the same registers. The pipeline is used as a cache of recently modified registers.

For now, we will try (A) only. (B) is used a lot in wide superscalar CPUs, precisely to limit the number of ports of the register file.

The following addressing modes are available:

(1) ST RD, [RS1 + RS2] Normal dual index access, 3 register reads.
(2) ST RD, [R0 + RS2] Not used.
(3) ST RD, [RS1 + R0] Normal single register access, written as ST RD, [RS].
(4) ST RD, [RS1 + 0(immediate)] Not used.
(5) ST RD, [RS1 + Immediate] Normal single register access + offset.
(6) ST RD, [R0 + Immediate] Page zero addressing. Not used (Maybe for embedded. Page zero is often marked as invalid by the MMU for NULL pointer access detection).

(R0 is hardwired = 0. “not used” means that compilers do not naturally generate this configuration.)

Case (1) can only be optimized through bypassing or with another register read port.
By optimizing immediates and R0 as the second register operand, cases (3),(4),(5) and (6) will be optimized. Only (2) will not be optimized, but it doesn’t matter.

The MCU still needs two cycles for stores (during the first cycle, the MCU reads cache tags then eventually updates cache data during the second cycle), the IU is stalled for one cycle when a store is immediately followed by a load instruction or by another store.
This optimization works for random code, not for block copies with back-to-back memory accesses.

The floating point instructions STF and STDF are also sometimes faster, but the STF FX,[RS1+RS2] and STDF FX,[RS1+RS2] forms are still not optimized even though they use only two integer registers.

This tweak seems to give a few percents of extra performance.

2 thoughts on “Faster stores for IU_PIPE5

  1. I’m a bit stuck as to figuring out ram access on the Cyclone V GX…. its not a part of an FPGA I’m familiar with using (I’ve only done general slow IO).

    I did get windows installed on my computer so I could run the Windows Versions of ISE (Xilinx’s drivers weren’t compatible with my kernel and would have required some hacking around to get working on Linux unfortunately.) and Quartus,

    .

    • DDRAM controllers are complex, and Altera’s seems worse than Xilinx’s.
      Sorry for the delay.
      I have been busy lately trying to fix the unsupported operating systems, and found a few bugs.

Comments are closed.