Computing floating-point division from an accurate reciprocal [WiP]

This article addresses the problem of computing $\circ(\frac{x}{y})$ from $\circ(\frac{1}{y})$, targetting single and double floating-point precision.

New method description


The method we use is based on the end of a Newton-Raphson division iteration [1, 2]. The iteration starts with an accurate approximation $r$ of $\frac{1}{y}$ and $a_0 = x \times \circ(\frac{1}{y})$

the error of $a_0$ is $\epsilon_0$,
$\delta_0 = | \frac{x}{y} - \circ(x \times r)|$
$\delta_0 = | \frac{x}{y} - x \times \frac{1}{y} (1 + \epsilon_r) (1 +\epsilon_{mult}) |$
$\delta_0 = | \frac{x}{y} \times (\epsilon_r + \epsilon_{mult} + \epsilon_r . \epsilon_{mult}) | $
$\epsilon_0 = \frac{\delta_0}{\frac{x}{y}} \le | \epsilon_r + \epsilon_{mult} + \epsilon_r . \epsilon_{mult} |$

$r$ is assumed to be a correctly rounded approximation of $\frac{1}{y}$ and $a_0$ is also implemented as a correctly rounded multiplication, so  $ \epsilon_{mult}$ and $\epsilon_r$ are bounded by $2^{-53} $ which means:

$\delta_0 \le |2^{-52} + 2^{-106}|$: $a_0$'s accuracy is a little greater than one $ulp$. Our goal is to get a final accuracy of less than one half of one $ulp$ (i.e. a correctly rounded result).

Our method is based on the following (standard) iteration:
$ e_1 = x - y \times a_0 $
$ a_1 = a_0 + e_1 \times r $

A piece of pythonsollya code available at the end of this article can help measure the accuracy of this method in a few random test cases.

Without special care this method is not valid for all inputs:

  • $r$ overflows if $y <= 2^{-127}$ 
  • computing $r$ raises DivByZero even if $x$ is not a canonical number 

Special cases management


Once the numerical behavior is proven we can focus on the behavior on special cases.

x y $\frac{1}{y}$ $\frac{x}{y}$ New method
$a_0$ $e_1$ $a_1$
0 0 $\infty$ DBZ $qNaN$ IO $qNaN$ IO $qNaN$ $qNaN$
Num Num $0$ $0$ $0$ $0$
$\infty$ $0$ $0$ $0$ $0$ $0$
$qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$
$\infty$ 0 $\infty$ DBZ $\infty$ $\infty$  $qNaN$ IO $qNaN$
Num Num $\infty$ $\infty$ $qNaN$ IO $qNaN$ IO
$\infty$ $0$ $0$ $0$ $0$ $0$
$qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$
$Num$ 0 $\infty$ DBZ $\infty$ DBZ $\infty$ $qNaN$ IO $qNaN$
Num Num (A) Num (A) Num Num Num
$\infty$ $0$ $0$ $0$ $0$ $0$
$qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$
$NaN$ 0 $\infty$ DBZ $\infty$ DBZ $qNaN$ $qNaN$  $qNaN$
Num Num (A) $qNaN$ $qNaN$ $qNaN$ $qNaN$
$\infty$ $0$ $qNaN$ $qNaN$ $qNaN$ $qNaN$
$qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$

As illustrated by the table above, most cases behave similarly if we first compute $\frac{1}{y}$ as a single floating-point operation and then use the iteration to compute $\frac{x}{y}$. Here behaving similarly means returning the same results and raising the same exception flags.
   However there are a few cases (indicated in bold red) were a spurious flag can be raisen or an incorrect result returned. We have to carrefully consider the flags during each of the steps of the new methods: $\circ(\frac{1}{y})$, $a_0$, $e_1$, $a_1$ as those stucking flags will be accumulated during the computation.
    The case (A) requires specific attention. An overflow in $\frac{1}{y}$ can occur (with or without the IEEE overflow flags) while $\frac{x}{y}$ remains in the range of normal numbers. For example if $y=2^{-149}$ (the smallest subnormal numbers) and $x = 2^{-126}$ (the smallest normal number). $\circ_{RN}(\frac{1}{y}) = +\infty $ but  $\circ_{RN}(\frac{x}{y}) = 2^{23}$.

To avoid intermediary overflows or underflows or spurious exceptions we modify the method as follows:

$ y', s = normalize(y) $ (silent)
$ x', t = normalize(x) $ (silent)
$ r = \circ(\frac{1}{y'})$ (silent)
$ - = fsdiv(x, y) $ (result discarded)
$ a_0 = \circ(x' \times r)$
$ e_1 = x' - y' \times a_0 $ (silent)
$ a_1 = a_0 + e_1 \times r $ (silent)
$ n, m = 1, 1 \ if ' x' > y' \ else \ 0.5, 2 $
$ R = (a_1 \times m) \times (n \times s \times t) $

Each operation annoted with (silent) does not raise IEEE754 flags nor exceptions. For a number $y$ The normalize function returns the floating-point mantissa $y'$ of $y$ and the reciprocal $s$ of the scaling factor such that $y'$ and $s$ are floating-point numbers $y' = s \times y$, $1 \le y' < 2$, and $s$ is a power of 2.
If $y=\pm 0$, $y = \pm \infty$ or $y=NaN$ then $normalize(y) = y, 1.0 $.
normalize does not raise any IEEE754 flags.

Assuming $x$ and $y$ are non-zero numbers , $1 \le |x'| < 2$ and $1 \le |y'| < 2$ which means  $ \frac{1}{2} \le | \frac{x'}{y'} | \le 2 $.

The floating-point seed for division function, $fsdiv(x, y)$,  is a specific instruction which raises all the specific cases IEEE flags as if it computed $\circ(\frac{x}{y})$ but does not compute the actual result (nor detect and raise flags for numerical overflow nor underflow). It can been implemented easily and cheaply (and already exists in some architectures).

Managing overflow and underflow

We must ensure that no loss of precision occurs during the intermediate iteration while the final result is correctly rounded (even if subnormal). 
To implement this we rely on normalized inputs $x'$ and $y'$ and on normalization factors $s$ and $t$. 
$s = 2^p$ and $t = 2^q$, $e_{min} \le p, q \le e_{max} $ (e.g $ e_{min} = -149$, $e_{max}=127$ for $fp_{32}$ format)
While $a_1$ remains within $[0.5, 2]$, $s \times t$ may overflow or underflow.
Thus we introduced the extra scaling factors $n$ and $m$ which brings $a_1 \times m \ in \ [1, 2[$
and if $f = n \times m \times t$ overflows then $R$ overflows, and reciprocaly if it less than the the smallest normal numbers and $a_1$ is inexact then $R$ underflows.


The following code, based on pythonsollya, implements a quick and dirty test to check the method error.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# -*- coding: utf-8 -*-
import sollya
import random

sollya.settings.display = sollya.hexadecimal

def sollya_fma(x, y, z, precision, rndmode=sollya.RN):
    """ implement fused multiply and add using sollya.
        The internal precision must be enough so that x * y
        is computed exactly """
    return sollya.round(x * y + z, precision, rndmode)

def sollya_mul(x, y, precision, rndmode=sollya.RN):
    """ implement multiply using sollya """
    return sollya.round(x * y, precision, rndmode)

def iteration(x, y, approx_div, approx_recp, precision=sollya.binary64):
    """ implementation of refinement iteration """
    e_n = sollya_fma(y, -approx_div, x, precision)
    r_n = sollya_fma(approx_recp, e_n, approx_div, precision)
    return r_n


def my_division(x, y, precision=sollya.binary64, rndmode=sollya.RN):
    """ full division routine """
    approx_recp = sollya.round(1.0 / y, precision, rndmode)
    approx_div = sollya_mul(approx_recp, x, precision)
    NUM_ITER = 1
    for i in range(NUM_ITER):
        approx_div = iteration(x, y, approx_div, approx_recp, precision)
    return approx_div


NUM_TEST = 1000000
PRECISION = sollya.binary64
RNDMODE = sollya.RN
num_error = 0

for i in range(NUM_TEST):
    input_value_x = sollya.round(random.random(), PRECISION, RNDMODE)
    input_value_y = sollya.round(random.random(), PRECISION, RNDMODE)
    expected = sollya.round(input_value_x / input_value_y, PRECISION, RNDMODE)
    result = my_division(input_value_x, input_value_y, PRECISION, RNDMODE)
    error = abs(expected - result)
    rel_error = abs(error / expected)
    if error != 0.0:
        num_error += 1
        print("{} / {} = {}, result is {}, error = {}".format(input_value_x, input_value_y, expected, result, error))


print("{} error(s) encountered".format(num_error))

References:

  • [1] Newton-Raphson Algorithms for Floating-Point Division Using an FMA, JM Muller et al. (pdf)
  • [2] Proving the IEEE Correctness of Iterative Floating-Point Square Root, Divide, and Remainder Algorithms, M Cornea-Hassegan (pdf)
  • [3] pythonsollya's gitlab (Python wrapper for the Sollya library

Accelerating Erasure Coding with Bit Matrix Multiply


Introduction

In this article we will present a method for fast implementation of Erasure Coding primitives based on Galois Field arithmetic. The Jerasure library manual (pdf) is a good introduction to the erasure coding method considered in this article.

    For or purpose it suffices to say we have $r$ partial stripes. Each striple is constituted by $k$ $w$-bit data words and we want to compute the $m$ $w$-bit parity words to complete each striple.

Each parity word is computed as a dot-product of the data striple multiplied by a constant vectors whose values depends on the index of the parity word we want to compute (those index ranges from
$0$ to $m-1$). The vector elements and dot-product results are in $GF(2^w)$.

Multiplication by a constant through vector by matrix multiplication


We will consider the primitive operation on a $w$-bit data word $D$, which must be multiplied by a $w$-bit constant $C$ to get a $w$-bit parity word $P$: $D \times C = P$

The methode, due to Mastrovito (detailed in [5]), expands a multiplication by a constant and the modulo reduction into a multiplication by a bit matrix as follows:
$$
\begin{pmatrix}
data word
\end{pmatrix}
\times
\begin{pmatrix}
C . X^0 [g] \\
C . X^1 [g] \\
\vdots \\
C . X^{k-1} [g]
\end{pmatrix}
=
\begin{pmatrix}
parity word
\end{pmatrix}
$$

The multiplicand matrix is built by expanding $C . X^i [g] $ for each row $i$ of the matrix.
The matrix multiplication will accumulate the multiplication $d_i \times X^i [g] $ to build the final result:
\begin{align}
 P & = D \times C [g] \\
    & = (\oplus_{i=0}^{w-1} d_i \times X^i) \times C  [g] \\
    & = (\oplus_{i=0}^{w-1} d_i . C \times X^i)   [g] \\
    & = \oplus_{i=0}^{w-1} d_i . (C \times X^i [g] )
\end{align}

\begin{align*}
  \times &

\begin{pmatrix}
c_{0,0} & c_{0,1} & \dots & c_{0,w-1} \\
c_{1,0} & c_{1,1} & \dots & c_{1,w-1} \\
\vdots    &               &          &  \vdots \\
c_{k-1,0} & c_{k-1,1} & \dots & c_{w-1,w-1} \\
\end{pmatrix} \\
\begin{pmatrix} d_0 & d_1 & \dots & d_{w-1}  \end{pmatrix}   &
\begin{pmatrix}
p_0 & p_1 & \dots & p_{w-1}
\end{pmatrix}

\end{align*}

Multiplication parallelization

We have seen how to use vector by matrix multiplication to compute the multiplication of a data word to get part of a parity word (we still have to accumulate those parts to get the actual parity word). We can extend the left hand side vector to a full matrix and implements multiple multiplications by the same expanded constant C, realizing a Single Instruction Multiple Data (SIMD) operation.
With a single operation implementing a multiplication between $w \times w$ matrices we implement $w$ parallel multiplication by a constant C.

Erasure code operation


The macro operation for to compute erasure coding parity is a matrix multiplication where the right hand side matrix element are constant in $GF(2^w)$. This operation is decomposed into vector by constant vector multiplications, themselves decomposed into multiplication by constants and $xor$ accumulations.

References:

  • [1] Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions (pdf)
  • [2] Improving the coding speed of erasure codes with polynomial ring transforms (pdf)
  • [3] A New Architecture for a Parallel Finite Field Multiplier with Low Complexity Based on Composite Fields, C. Paar (pdf)
  • [4] Jerasure library manual (pdf)
  • [5] Fast and Secure Finite Field Multipliers (pdf) by Pamula and Tisserand (contains description of Mastrovito's method)