FIFOs, buffers, and pipelines

Things you ought to know!

PLOMB_FIFO

The PLOMB bus is fine but, sometimes, there are long combinatorial paths between initiators, targets, multiplexers and selectors. To cut these paths, one can place DFFs across the busses. Obviously, that adds latency but, with the help of pipelining, burst transactions and interleaved accesses between several initiators, throughput can be augmented.

We place a lock on the canal:

FIFO_1The buffer is best described as a FIFO. Input accesses are stored when the FIFO is not full and output accesses are retrieved while the FIFO is not empty.

FIFO_2It can be used for the plomb_w pipe with the .req/.ack signals and the plomb_r pipe with the .dreq/.dack signals. One can choose to place the delay on one side, on the other, or both.

The initial goal was not to store in a FIFO many pending accesses (although being able to store a burst can be useful sometimes) but to break combinatorial paths. The smallest possible FIFO has one cell and FIFO.full=NOT FIFO.empty. We get something like that:

FIFO_3Such a FIFO has the following behaviour:

FIFO_5This is not great. Even if the MEM side is always ready to accept data (MEM.ACK tied to 1), half the bandwidth is wasted.

Using a two cells FIFO, the bus can be used efficiently:

FIFO_6Because of the use of sequential elements, the feedback path (ACK) has a one cycle delay after the forward path (REQ & data). That delay is mitigated by being able to store up to two accesses. In the figure above, “D4” stall on the MEM side generates a stall on “D6” on the CPU side.

With a three cells FIFO, it is possible to hide one stall cycle:

FIFO_7Of course, a deeper FIFO can store more accesses and buffer more delays.

With a combinatorial path on ACK, you save one FIFO level:

FIFO_4The combinatorial path saves one cycle; a wait state on “D4” generates a wait state on “D5”.

FIFO_8
The actual PLOMB_FIFO entity is quite flexible and not as readable as I would have liked. Both the Write and the Read paths are configurable as:

  • COMB: FIFO with combinatorial path. Configurable depth.
  • SYNC: FIFO without combinatorial path. Configurable depth.
  • DIRECT: No FIFO, just wiring.

FIFO implementation

There are many many small FIFOs in the design.
Apart from the Ethernet MAC which is able to store a 64 bytes Ethernet frame and use a dual port memory with read and write pointers, most FIFOs are much smaller, sometimes only 1 or 2 cells, sometimes deep enough to store a whole burst transfer.
These FIFOs are usually implemented like that:

 IF push
     FIFO<=datain & FIFO(0 TO DEPTH-1);
 END IF;
 IF push and not pop
     IF lv='0' THEN
         lev<=lev+1;
     END IF;
     lv<=’1’;
 ELSIF pop and not push THEN
     IF lev/=0 THEN
         lev<=lev-1;
     ELSE
         lv<=’0’;
     END IF;
 END IF;

 dataout<=FIFO(lev);
 fifo_is_not_empty <= lv;

This is called sometimes a variable length shift register.
There are many other ways to implement a synchronous FIFO, this version is specifically tailored for Xilinx FPGAs, using SRL16 primitives:

FIFO_9These primitives use LUTs in a special way to assemble a 16bits shift register with a multiplexer. With that stuff, a 2-deep FIFO has the same area as a 16-deep FIFO.
I do not know how it is synthesised on Alteras, Lattice and others, and if alternate implementations could be preferable. To the limit, they can be implemented as discrete FFs which is usually better than true dual port memory for very small FIFOs.

VHDL2008 introduced generic types, allowing the creation of generic FIFOs. For now, the langage used is still VHDL’93 and each FIFO is described separately…

Other small FIFOs and buffers:

  • IU: Stores fetch accesses, the PC+nPC pipe.
  • IU: Stores data read back on the instruction and data busses
  • MCU: Buffers external accesses from the instruction and data busses. Write back buffer.
  • PLOMB_MUX: Stores port numbers
  • PLOMB_SEL: Stores port numbers
  • LANCE: Stores/prepares burst transfers
  • LANCE: Data read/write storage
  • ESP: Memory transfers
  • VID: Burst transfers and cross clock domain buffers.

etc…

Beyond the PLOMB_FIFO case, understanding this concept about flow regulation and pipelining is essential. I first encountered this problem a long time ago, when trying to implement a PCI [parallel] interface: You need a FIFO and a way to hold data when the current access is eventually delayed or cancelled.
(A lecture somewhat related to this subject : http://www-inst.eecs.berkeley.edu/~cs250/fa10/lectures/lec07.pdf)

Leave a Reply

Your email address will not be published. Required fields are marked *