First, there are two kinds of multipliers and dividers in our CPU, the integer and the floating point ones. They are a bit different.
- Integer multiplication and division must support both signed and unsigned numbers whereas floating point use unsigned scalars, sign is processed separately.
- Precision is different: 32×32→64 for integer (or 64×64→64 on many 64bits CPUs) instead of 23×23→23 (simple) or 52×52→52 (double) (or 52×52→104 as an intermediate result for CPUs having multiply-add instructions).
- Floating point multiplication and division is done on very narrow range numbers (for example 1≤x<2), and cannot overflow. Operands are pre-normalised.
- Floating point division is often coupled with square root. Square root is [almost] useless for integers.
Despite all these differences, some CPUs shared the multiplier and/or divider for integer and floating point. A SPARC example: The Sun/TI SuperSparc.
Actual multiplier implementations are about Booth encodings, Wallace trees and carry-save adders. I will not explain it here (maybe later). If you really want to learn now about low level details, there are plenty of lectures online.
For the multiplier, there is not much to say about our implementation. There is either a slow and small iterative multiplier or a multiplier built from the hardwired blocks available in large modern FPGAs.
The divider is also a quite classic “non restoring” divider. I will nevertheless detail it as it is often badly (or wrongly) explained.
One bit at a time, shift and add. Like a paper and pencil multiplication.
Early SPARC CPUs (V7) had no dedicated multiplication and division hardware. There was only the MULScc instruction for elementary one bit multiplication step.
(see SparcV8 standard, Annex E, page 205 for multiplications based on MULScc)
Our iterative multiplier is not optimized, it requires 32 cycles for all multiplications.
To the contrary, a software library using MULScc can adapt the number of iterations to the operand size. ‘int16’ or ‘char’ multiplications should be faster than ‘int32’ ones.
Our multiplier implements both signed and unsigned multiplications; it is derived from the “Pezaris” multiplier structure.
Combination of multiplier blocks
Xilinx FPGAs (and Alteras, and all large FPGAs) have plenty of signed 17×17 or 18×18 bits multipliers. We can take four of them to build a 32×32bits multiplier, or use iteratively one multiplier and a few multiplexers :
X = XH × 216 + XL
Y = YH × 216 + YL
X × Y = XH × YH × 232 + (XH × YL + XL × YH) × 216 + XL × YL
Of course, unsigned multiplication is done with signed multipliers with the MSB tied to 0.
Other solutions like Karatsuba or Toom-Cook are possible, but not worth the effort as many hardware multiplers are available.
The integer divider uses the « non restoring » algorithm. The basic principle is trivial. The tricky part is dealing with signed numbers. There are two methods:
- Cheat. Convert everything to positive then invert the result as needed. Because unsigned division is easy.
- Make an actual signed non restoring divider, and properly handle all the corner cases. This is the right way. This is what we do.
All division algorithms are based on numeric suites which converge towards the quotient. Many algorithms exists, they can be roughly divided into the additive (bit iterative) or multiplicative (functional) methods.
Our integer divider uses a bit iterative method, the “non restoring division”, which is able to deliver one quotient bit per cycle.
What is a division ?
Dividend / Divisor ≈ Quotient
Quotient × Divisor + Remainder = Dividend
For integer divisions, rounding is always done towards zero both for positive and negative numbers:
- 7 / 2 = 3.5
- 7 div 2 = 3, remainder = 1
- -7 div 2 = -3, remainder = -1
- 7 div -2 = -3, remainder = 1
- -7 div -2 = 3, remainder = -1
The remainder should always have the same sign as the dividend, or be equal to zero.
Note that integer divisions by powers of two are often replaced by right shifts. This is false for negative numbers:
- 7 >> 1 = 3
- −7 >> 1 = -4
SPARC divisions do not provide the remainder (contrary to x86, for example). It offers more freedom for the implementation. Still, divisions are used very often for converting from binary to decimal, whenever a number is printed. That conversion is done with repeated divisions and modulo operations, so the remainder can be sometimes quite useful.
Natural numbers can be decomposed into a sum of products:
X = x0 × 20 + x1 × 21 + x2 × 22 + x3 × 23 ….
The most obvious decomposition uses for x0…xn the digits 0 and 1. This is a plain binary number. (For negative numbers, use two’s complement, multiply by -1 the MSB)
It is also possible to use the digits +1 and -1 (often written as 1). Transformation into the binary form is straightforward.
We generate the vector a which replaces -1 by 0. X = a – not(a) = a + a + 1 = 2a+1.
(Two’s complement : -a = not(a) + 1).
Conversion from (+1/-1) to (1/0), e.g. binary is just shifting left, placing 1 into the LSB and inverting the MSB.
[Many other decompositions are possible, for example -1/0/+1, which has its uses, for another time…]
Non Restoring Division
Dividing is about keeping the invariant :
Divisor × Quotienti + Remainderi = Dividend
We start with R0=Dividend and Q0=0.
Then, we minimise the remainder, from MSB to LSB, by subtracting the divisor when it is positive and adding when it is negative. (And, for signed division, when the divisor is negative, do the inverse…)
Si = sign(Ri-1)
Ri = Ri-1 – Divisor × Si × 232-i
Qi = Qi-1 + 232-i × Si
R is 64 bits wide at first, then it looses one bit every cycle (except when there is an overflow, but in that case the result will be wrong anyway)
Instead of shifting right the divisor (X×Si×232-i) and have a 64bit add/sub, it is better to shift left the remainder and keep the divisor aligned left, using only a 32bits addition.
We get something like :
R’i = 2 × R’i-1 – Divisor × Si × 232
After the final iteration, the remainder sign may be wrong : have a sign opposite to the dividend. In that case, the remainder must be ‘corrected’ by adding or subtracting the divisor and, correspondingly, subtract or add 1 to the quotient. Just before that, the quotient has been converted from [+1/-1] into binary form by shifting left and placing 1 into the LSB.
For unsigned division, a correction is necessary when the remainder is negative : It is corrected by adding the dividend to the remainder and, correspondingly, decrement the quotient. Decrementing the quotient is simple because we have just placed a 1 in the LSB. Unsigned divisions are easy!
For signed division, an adder is necessary for correcting negative results which are incremented.
Well, not exactly :
There are a few important details to address :
– What is the sign of zero ?
– Division by zero
Sign of zero
In the equations above, there is the sign(Ri) operator
The remainder may be positive, negative or equal to zero.
WHAT IS THE SIGN OF ZERO?
Ideally, zero should have the same sign as the dividend.
– Let’s try first with unsigned division.
If the [intermediate] remainder is equal to zero, it shall be considered as positive:
In that case, we set the quotient bit to +1 and the remainder becomes negative as we subtract X * 232-i. During the following iterations, the quotient digit will remain -1 and the remainder will stay negative. For example, starting from bit 4 downto bit 0:
– X × 24 + X × 23 + X × 22 + X × 21 + X × 20 = – X
As the remainder has a wrong sign, it must be corrected after the last iteration : The quotient is decremented and the final remainder becomes equal to zero :
“…+1-1-1…-1-1” → “…100…001” (conversion to binary) → “…100…000” (correction)
For unsigned divisions, sign(R) is :
IF (R>=0) THEN sign=1 ELSE sign=-1
With two’s complement numbers, (R>=0) is just checking the MSB. Easy !
– Signed division.
With signed division, there is a problem when the dividend is negative.
To get similar results as for unsigned division, a zero remainder should be considered as negative when the dividend is negative. The test becomes “IF (R<=0)”. Alas, this is far more difficult than “R>=0” with two’s complement numbers, a zero comparator must be added.
As the remainder is a 64bits value (at first), you need a pretty large comparator.
There is an alternative, testing (Ri<0) = NOT (Ri>=0).
Let’s take a negative dividend and positive divisor.
If the remainder is equal to zero, we will subtract the divisor and get R=-X, then, like for unsigned division, it will keep the same sign and value.
During the last iteration, if the sign is wrong, the quotient will be corrected as for unsigned division (‘decreased’ quotient because too much was removed from the remainder).
If the sign is correct but the remaider equal to the divisor, the quotient must also be corrected but in the opposite direction (‘increased’ quotient because the remainder is too large).
We have to compare the remainder with zero, then with the divisor which can be transformed into another comparison with zero, by using the divider’s adder/subtractor, with an additional multiplexer for disabling the left shift.
What are the benefits of this latter method ?
- The zero comparator is 32bits wide instead of 64bits
- The comparator is not in the iteration, allowing shorter cycle delay
- The remainder low bits and shift register can be reused to store the quotient
The downsides are :
- There is more work during the last iteration
- There is an additional 2→1 multiplexer
This second option is the one currently implemented in iu_muldiv.vhd
Divisions can overflow. Easily.
What is an overflow ?
For unsigned divisions, there is an overflow if the quotient is larger than or equal to 0x100000000 = 232, as we implement a 64bits / 32bits = 32bits division.
Checking overflows is testing Dividend – (Divisor × 232) ≥ 0
…which is straightforward, hardware for subtracting and shifting the divisor is already there. This is done during the first cycle. If the division does not overflow, the remainder is always negative during the second cycle.
For signed divisions, there are positive and negative overflows.
Positive overflows begin with 231. During the first cycle, the divisor is shifted left 32 times and overflows beyond 232 are detected, during the second cycle, the divisor is shifted left 31 times and overflows between 231 and 232 are detected.
Negative numbers are more complex, because of the value 0x80000000 = -231.
During the first and second cycles, the divider detects overflows like for positive numbers, checking -232 and -231.
At the end of the division sequence, if the result happens to be exactly equal to 0x80000000, overflow is cancelled. The divider detects iteratively the sequence Q = 01111…1111 = 0x7FFFFFFF, followed by a correction step.
There is a final weird overflow, when both the dividend and divisor are negative and the quotient is exactly equal to +231. We also get Q=0x7FFFFFFF before correction. This is the only overflow which is detected at the end of the division.
When dividend and divisor have the same size (e.g. 32bits/32bits or 64bits/64bits divisions), MIN_INT/-1 is the only possible overflow (unless you consider divide by zero as an overflow…).
On SPARCs, overflows are signalled with the “V” status bit. The result is set to the maximum positive or negative number : 0xFFFFFFFF for unsigned, 0x7FFFFFFF or 0x80000000 for signed. Some other CPU architectures don’t force the divider output with overflows (PowerPC for example), divisions can generate sometimes different results.
Overflow handling may be different among langages. The C langage does not check for overflows on divisions, it is the programmer’s job to check operands or invoke specialised constructs (like inline assembly) to get that information
SPARCs does not trap on division overflows, some other CPU architectures can do that, for example the x86 which have a common trap for division by zero and division overflow.
Divide by zero
Another zero comparator.
Detecting a division by zero and triggering a trap is not usually an urgent matter as the application will be probably killed after such an error. It is possible, for example, to do it iteratively.
The problem may instead arise from the rest of the CPU. Most integer instructions never trap, and the ones who does (for example software exception instructions) do it quickly, in one cycle.
Validating the divider was done with test benchs, random numbers and tests of pathological cases. Difficult cases occur mostly with exact divisions, which appear rarely with random numbers as operands.
As a deluxe premium ultimate website, here is a spreadsheet which shows signed non restoring division. Try random divisions. Try exact divisions and watch the differences in the remainder values between the two available dividers when the dividend and divisor signs are changed.
Making a correct divider and understanding how it works has been incredibly difficult, for me.
There are numerous onlines sources, lectures about division from famous universities, and most of them are wrong. I know a book about “digital arithmetics” where the chapter dedicated to signed non restoring division misses all the the problems related to zero remainder. As if being able to divide correctly -100/10 or -256/16 didn’t matter. They do not care about overflow either.
They certainly never tried the stuff they are teaching.
Baltringues. Minables. Ratés.
This discussion about multiplication and division will continue in a future article about the floating point unit, which can also calculate square roots (gasp !).