Introduction to polynomial approximation (WiP)

The goal is to find the best polynomial to approximate a given function $f$ on a given interval $I$.
The best generally meaning the polynomial which minimizes the error (absolute or relative) with the input function. This error is often $L-\infty$ (the maximum of the absolute difference between the function and the approximating polynomial).

Remez:

   The Remez algorithm find solution to linear system of $n+2$ equations $\sum_i{a_i.x_j^i} + (-\epsilon)^i = f(x_j) $ for j in $[0, n+1]$. It starts by setting $x_j$ values to the extrema of the degree $n$ Chebyshev polynomial (mapped to the input interval through affine scaling). Then solve the linear systems which provides a first set of $a_i$ coefficients. Then the extrema of $| P - f $ are found and used as a new set of $x_j$ and the processus is iterated. Theoretically, the process stops when the approximation errors are all equal and of alternating sign, the returned P is the minmax approximation polynomial.
During the iteration the reachable lower bound for the approximation error is the minimum of the absolute value of the extrema the difference $P - f$ (therorem de la Vallée Poussin)

Using Euclidean Lattices

Educational example of polynomial approximation: https://github.com/nibrunie/YourOwnPersonnalRemez

References

- Efficient L-inf polynomial approximations (https://hal.archives-ouvertes.fr/inria-00119513v1)

Testing metalibm from the demo Docker image

A docker image is available with a demonstration version of metalibm:
docker run -ti nibrunie/metalibm:demo

The basic metalibm commands can be tested within this image:
cd /home/metalibm
export PYTHONPATH=$PWD:$PYTHONPATH
export ML_SRC_DIR=$PWD
python3 metalibm_functions/ml_exp.py --auto-test 1000 --execute

The Dockerfile used to build this image is:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
FROM ubuntu:latest

# updating repo list
RUN apt-get update

# installing basic packages (build, repo management, ...)
RUN apt-get install -y build-essential
RUN apt-get install -y git
RUN apt-get install -y wget

# install sollya's dependency
RUN apt-get install -y libmpfr-dev libmpfi-dev libfplll-dev libxml2-dev

RUN apt-get install -y python3 python3-setuptools libpython3-dev python3-pip

# install sollya from its current weekly release (which implement sollya_lib_obj_is_external_data
# contrary to sollya 7.0 release)
WORKDIR  /home/
RUN wget http://sollya.gforge.inria.fr/sollya-weekly-08-18-2019.tar.gz
RUN tar -xzf sollya-weekly-08-18-2019.tar.gz
WORKDIR /home/sollya-weekly-08-18-2019/
RUN mkdir -p /app/local/python3/
RUN ./configure --prefix /app/local
RUN make -j8
RUN make install


# installing pythonsollya
WORKDIR /home/
ENV PATH=/app/local/bin:$PATH
ENV SOLLYA_DIR=/app/local/
ENV PREFIX=/app/local/
ENV INSTALL_OPTIONS="-t /app/local/"

# installing pythonsollya from gitlab version
RUN git clone https://gitlab.com/metalibm-dev/pythonsollya /home/new_pythonsollya/

# for python 3
WORKDIR /home/new_pythonsollya/
RUN make SOLLYA_DIR=/app/local/ PREFIX=/app/local/ INSTALL_OPTIONS="-t /app/local/python3/" PYTHON=python3 PIP=pip3 install

# checking pythonsolya install
RUN apt-get install python3-six
RUN LD_LIBRARY_PATH="/app/local/lib" python3 -c "import sollya"

# installing gappa
WORKDIR /home/
RUN apt-get install -y libboost-dev
RUN wget https://gforge.inria.fr/frs/download.php/file/37624/gappa-1.3.3.tar.gz
RUN tar -xzf gappa-1.3.3.tar.gz
WORKDIR /home/gappa-1.3.3/
RUN ./configure --prefix=/app/local/
RUN ./remake
RUN ./remake install

# downloading metalibm
WORKDIR /home/
RUN git clone https://github.com/kalray/metalibm.git
WORKDIR /home/metalibm/

# testing metalibm install
ENV LD_LIBRARY_PATH=/app/local/lib/
ENV PYTHONPATH=/app/local/python3/
RUN PYTHONPATH=$PWD:$PYTHONPATH  ML_SRC_DIR=$PWD python3 valid/soft_unit_test.py

Computing floating-point division from an accurate reciprocal [WiP]

This article addresses the problem of computing $\circ(\frac{x}{y})$ from $\circ(\frac{1}{y})$, targetting single and double floating-point precision.

New method description


The method we use is based on the end of a Newton-Raphson division iteration [1, 2]. The iteration starts with an accurate approximation $r$ of $\frac{1}{y}$ and $a_0 = x \times \circ(\frac{1}{y})$

the error of $a_0$ is $\epsilon_0$,
$\delta_0 = | \frac{x}{y} - \circ(x \times r)|$
$\delta_0 = | \frac{x}{y} - x \times \frac{1}{y} (1 + \epsilon_r) (1 +\epsilon_{mult}) |$
$\delta_0 = | \frac{x}{y} \times (\epsilon_r + \epsilon_{mult} + \epsilon_r . \epsilon_{mult}) | $
$\epsilon_0 = \frac{\delta_0}{\frac{x}{y}} \le | \epsilon_r + \epsilon_{mult} + \epsilon_r . \epsilon_{mult} |$

$r$ is assumed to be a correctly rounded approximation of $\frac{1}{y}$ and $a_0$ is also implemented as a correctly rounded multiplication, so  $ \epsilon_{mult}$ and $\epsilon_r$ are bounded by $2^{-53} $ which means:

$\delta_0 \le |2^{-52} + 2^{-106}|$: $a_0$'s accuracy is a little greater than one $ulp$. Our goal is to get a final accuracy of less than one half of one $ulp$ (i.e. a correctly rounded result).

Our method is based on the following (standard) iteration:
$ e_1 = x - y \times a_0 $
$ a_1 = a_0 + e_1 \times r $

A piece of pythonsollya code available at the end of this article can help measure the accuracy of this method in a few random test cases.

Without special care this method is not valid for all inputs:

  • $r$ overflows if $y <= 2^{-127}$ 
  • computing $r$ raises DivByZero even if $x$ is not a canonical number 

Special cases management


Once the numerical behavior is proven we can focus on the behavior on special cases.

x y $\frac{1}{y}$ $\frac{x}{y}$ New method
$a_0$ $e_1$ $a_1$
0 0 $\infty$ DBZ $qNaN$ IO $qNaN$ IO $qNaN$ $qNaN$
Num Num $0$ $0$ $0$ $0$
$\infty$ $0$ $0$ $0$ $0$ $0$
$qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$
$\infty$ 0 $\infty$ DBZ $\infty$ $\infty$  $qNaN$ IO $qNaN$
Num Num $\infty$ $\infty$ $qNaN$ IO $qNaN$ IO
$\infty$ $0$ $0$ $0$ $0$ $0$
$qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$
$Num$ 0 $\infty$ DBZ $\infty$ DBZ $\infty$ $qNaN$ IO $qNaN$
Num Num (A) Num (A) Num Num Num
$\infty$ $0$ $0$ $0$ $0$ $0$
$qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$
$NaN$ 0 $\infty$ DBZ $\infty$ DBZ $qNaN$ $qNaN$  $qNaN$
Num Num (A) $qNaN$ $qNaN$ $qNaN$ $qNaN$
$\infty$ $0$ $qNaN$ $qNaN$ $qNaN$ $qNaN$
$qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$ $qNaN$

As illustrated by the table above, most cases behave similarly if we first compute $\frac{1}{y}$ as a single floating-point operation and then use the iteration to compute $\frac{x}{y}$. Here behaving similarly means returning the same results and raising the same exception flags.
   However there are a few cases (indicated in bold red) were a spurious flag can be raisen or an incorrect result returned. We have to carrefully consider the flags during each of the steps of the new methods: $\circ(\frac{1}{y})$, $a_0$, $e_1$, $a_1$ as those stucking flags will be accumulated during the computation.
    The case (A) requires specific attention. An overflow in $\frac{1}{y}$ can occur (with or without the IEEE overflow flags) while $\frac{x}{y}$ remains in the range of normal numbers. For example if $y=2^{-149}$ (the smallest subnormal numbers) and $x = 2^{-126}$ (the smallest normal number). $\circ_{RN}(\frac{1}{y}) = +\infty $ but  $\circ_{RN}(\frac{x}{y}) = 2^{23}$.

To avoid intermediary overflows or underflows or spurious exceptions we modify the method as follows:

$ y', s = normalize(y) $ (silent)
$ x', t = normalize(x) $ (silent)
$ r = \circ(\frac{1}{y'})$ (silent)
$ - = fsdiv(x, y) $ (result discarded)
$ a_0 = \circ(x' \times r)$
$ e_1 = x' - y' \times a_0 $ (silent)
$ a_1 = a_0 + e_1 \times r $ (silent)
$ n, m = 1, 1 \ if ' x' > y' \ else \ 0.5, 2 $
$ R = (a_1 \times m) \times (n \times s \times t) $

Each operation annoted with (silent) does not raise IEEE754 flags nor exceptions. For a number $y$ The normalize function returns the floating-point mantissa $y'$ of $y$ and the reciprocal $s$ of the scaling factor such that $y'$ and $s$ are floating-point numbers $y' = s \times y$, $1 \le y' < 2$, and $s$ is a power of 2.
If $y=\pm 0$, $y = \pm \infty$ or $y=NaN$ then $normalize(y) = y, 1.0 $.
normalize does not raise any IEEE754 flags.

Assuming $x$ and $y$ are non-zero numbers , $1 \le |x'| < 2$ and $1 \le |y'| < 2$ which means  $ \frac{1}{2} \le | \frac{x'}{y'} | \le 2 $.

The floating-point seed for division function, $fsdiv(x, y)$,  is a specific instruction which raises all the specific cases IEEE flags as if it computed $\circ(\frac{x}{y})$ but does not compute the actual result (nor detect and raise flags for numerical overflow nor underflow). It can been implemented easily and cheaply (and already exists in some architectures).

Managing overflow and underflow

We must ensure that no loss of precision occurs during the intermediate iteration while the final result is correctly rounded (even if subnormal). 
To implement this we rely on normalized inputs $x'$ and $y'$ and on normalization factors $s$ and $t$. 
$s = 2^p$ and $t = 2^q$, $e_{min} \le p, q \le e_{max} $ (e.g $ e_{min} = -149$, $e_{max}=127$ for $fp_{32}$ format)
While $a_1$ remains within $[0.5, 2]$, $s \times t$ may overflow or underflow.
Thus we introduced the extra scaling factors $n$ and $m$ which brings $a_1 \times m \ in \ [1, 2[$
and if $f = n \times m \times t$ overflows then $R$ overflows, and reciprocaly if it less than the the smallest normal numbers and $a_1$ is inexact then $R$ underflows.


The following code, based on pythonsollya, implements a quick and dirty test to check the method error.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# -*- coding: utf-8 -*-
import sollya
import random

sollya.settings.display = sollya.hexadecimal

def sollya_fma(x, y, z, precision, rndmode=sollya.RN):
    """ implement fused multiply and add using sollya.
        The internal precision must be enough so that x * y
        is computed exactly """
    return sollya.round(x * y + z, precision, rndmode)

def sollya_mul(x, y, precision, rndmode=sollya.RN):
    """ implement multiply using sollya """
    return sollya.round(x * y, precision, rndmode)

def iteration(x, y, approx_div, approx_recp, precision=sollya.binary64):
    """ implementation of refinement iteration """
    e_n = sollya_fma(y, -approx_div, x, precision)
    r_n = sollya_fma(approx_recp, e_n, approx_div, precision)
    return r_n


def my_division(x, y, precision=sollya.binary64, rndmode=sollya.RN):
    """ full division routine """
    approx_recp = sollya.round(1.0 / y, precision, rndmode)
    approx_div = sollya_mul(approx_recp, x, precision)
    NUM_ITER = 1
    for i in range(NUM_ITER):
        approx_div = iteration(x, y, approx_div, approx_recp, precision)
    return approx_div


NUM_TEST = 1000000
PRECISION = sollya.binary64
RNDMODE = sollya.RN
num_error = 0

for i in range(NUM_TEST):
    input_value_x = sollya.round(random.random(), PRECISION, RNDMODE)
    input_value_y = sollya.round(random.random(), PRECISION, RNDMODE)
    expected = sollya.round(input_value_x / input_value_y, PRECISION, RNDMODE)
    result = my_division(input_value_x, input_value_y, PRECISION, RNDMODE)
    error = abs(expected - result)
    rel_error = abs(error / expected)
    if error != 0.0:
        num_error += 1
        print("{} / {} = {}, result is {}, error = {}".format(input_value_x, input_value_y, expected, result, error))


print("{} error(s) encountered".format(num_error))

References:

  • [1] Newton-Raphson Algorithms for Floating-Point Division Using an FMA, JM Muller et al. (pdf)
  • [2] Proving the IEEE Correctness of Iterative Floating-Point Square Root, Divide, and Remainder Algorithms, M Cornea-Hassegan (pdf)
  • [3] pythonsollya's gitlab (Python wrapper for the Sollya library

Accelerating Erasure Coding with Bit Matrix Multiply


Introduction

In this article we will present a method for fast implementation of Erasure Coding primitives based on Galois Field arithmetic. The Jerasure library manual (pdf) is a good introduction to the erasure coding method considered in this article.

    For or purpose it suffices to say we have $r$ partial stripes. Each striple is constituted by $k$ $w$-bit data words and we want to compute the $m$ $w$-bit parity words to complete each striple.

Each parity word is computed as a dot-product of the data striple multiplied by a constant vectors whose values depends on the index of the parity word we want to compute (those index ranges from
$0$ to $m-1$). The vector elements and dot-product results are in $GF(2^w)$.

Multiplication by a constant through vector by matrix multiplication


We will consider the primitive operation on a $w$-bit data word $D$, which must be multiplied by a $w$-bit constant $C$ to get a $w$-bit parity word $P$: $D \times C = P$

The methode, due to Mastrovito (detailed in [5]), expands a multiplication by a constant and the modulo reduction into a multiplication by a bit matrix as follows:
$$
\begin{pmatrix}
data word
\end{pmatrix}
\times
\begin{pmatrix}
C . X^0 [g] \\
C . X^1 [g] \\
\vdots \\
C . X^{k-1} [g]
\end{pmatrix}
=
\begin{pmatrix}
parity word
\end{pmatrix}
$$

The multiplicand matrix is built by expanding $C . X^i [g] $ for each row $i$ of the matrix.
The matrix multiplication will accumulate the multiplication $d_i \times X^i [g] $ to build the final result:
\begin{align}
 P & = D \times C [g] \\
    & = (\oplus_{i=0}^{w-1} d_i \times X^i) \times C  [g] \\
    & = (\oplus_{i=0}^{w-1} d_i . C \times X^i)   [g] \\
    & = \oplus_{i=0}^{w-1} d_i . (C \times X^i [g] )
\end{align}

\begin{align*}
  \times &

\begin{pmatrix}
c_{0,0} & c_{0,1} & \dots & c_{0,w-1} \\
c_{1,0} & c_{1,1} & \dots & c_{1,w-1} \\
\vdots    &               &          &  \vdots \\
c_{k-1,0} & c_{k-1,1} & \dots & c_{w-1,w-1} \\
\end{pmatrix} \\
\begin{pmatrix} d_0 & d_1 & \dots & d_{w-1}  \end{pmatrix}   &
\begin{pmatrix}
p_0 & p_1 & \dots & p_{w-1}
\end{pmatrix}

\end{align*}

Multiplication parallelization

We have seen how to use vector by matrix multiplication to compute the multiplication of a data word to get part of a parity word (we still have to accumulate those parts to get the actual parity word). We can extend the left hand side vector to a full matrix and implements multiple multiplications by the same expanded constant C, realizing a Single Instruction Multiple Data (SIMD) operation.
With a single operation implementing a multiplication between $w \times w$ matrices we implement $w$ parallel multiplication by a constant C.

Erasure code operation


The macro operation for to compute erasure coding parity is a matrix multiplication where the right hand side matrix element are constant in $GF(2^w)$. This operation is decomposed into vector by constant vector multiplications, themselves decomposed into multiplication by constants and $xor$ accumulations.

References:

  • [1] Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions (pdf)
  • [2] Improving the coding speed of erasure codes with polynomial ring transforms (pdf)
  • [3] A New Architecture for a Parallel Finite Field Multiplier with Low Complexity Based on Composite Fields, C. Paar (pdf)
  • [4] Jerasure library manual (pdf)
  • [5] Fast and Secure Finite Field Multipliers (pdf) by Pamula and Tisserand (contains description of Mastrovito's method)

Computing floating-point approximation of log(1+x)

log1p / log1pf standard functions are parts in any math library. Those functions respectively computes a double precision and a single precision approximation of the $log(1+x)$ of their double precision (resp. single precision) input $x$. Such functions exist in standard C library, also in c++ ([1]), python ([2]) and numerous other bindings.

log1p is similar (in essence) to expm1: it offers a more accurate implementation of a composition of logarithm (resp. exponential) and a linear function regularly used together. But it exhibits new challenges: the implementation must be very accurate around 0 to ensure expected numerical properties (e.g. faithful rounding).

Let us first study the possible approximation on two easy intervals before considering a more generic approximation.

Approximation log(1+x) around 0

In an interval close to 0, let us say $[-\epsilon, \epsilon]$, $log(1+x)$ can be accurately approximated by a small polynomial. This polynomial can be obtain with Remez algorithm for example.


Using the following piece of code (relying on pythonsollya), we can see that only a degree 6 polynomial is required to get an approximation of $\frac{log(1+x)}{x}$ accurate to 53 bits over $[-2^{-7}, 2^{-7}]$. $\frac{log(1+x)}{x}$ is targeted rather than $log(1+x)$ because the later is very close to $x$ around 0, thus dividing by $x$ allows to search for a polynomial with a first non-zero coefficient.


1
2
3
>>> import sollya
>>> sollya.guessdegree(sollya.log1p(sollya.x)/sollya.x, sollya.Interval(-2**-7, 2**-7), 2**-53)
[6;6]

Approximation log(1+x) for values lesser than -0.5

On $]-1, -0.5]$, it is easy to see (using Sterbeinz's lemma) that $1+x$ is exact, thus one can fallback to $\circ(log(\circ(1+x)))$ to implement log1p(x) accurately, as $\circ(1+x) = 1+x$.

Generic appoximation

The floating-point addition $\circ(1+x)$ is not exact for most value of $x$. In fact for x between 2 and  $2^{p-1}$ it is only inaccurate for a few p-bit values matching the pattern $2^k - 1 + r$, with $ 2 \le k < p -1 $, $r < 1$. 

But we will discard this fact and focus on a more generic approximation, using the floating-point decomposition of $x$. We will assume the availability of a fast reciprocal operation $fast\_rcp$ which returns an approximation of the reciprocal accurate to $f$ bits. We will use the standard decomposition $x = m .2^{-e}$, where $m$ is $x$'s (signed) mantissa and $e$ is its exponent.

$1+x = 1 + m .2^e = 2^e . 2^{-e} + m .2^{-e} = 2^e \times (m + 2^{-e})$ (1)

$ log(1+x) = e \times log(2) + log (m + 2^{-e}) $ (2)

Let us rename $t = m + 2^{-e}$ and let us consider $r = fast\_rcp(t)$, which means $r \sim \frac{1}{m + 2^{-e}} $

$m + 2^{-e} = \frac{(m + 2^{-e}) \times r}{r}$ (3)
$ log (m + 2^{-e}) = log(t.r) - log(r) $ (4)

$ r \times 2^{-e}$ is exact, because it is a multiplication by a power of $2$.

$ log(1+x) = e \times log(2) + log(t.r) - log(r) $  (5)

We would like to tabulate $-log(r)$, as well as $log(2)$ and $log(t.r)$ can be approximated by a polynomial.
As $r \sim \frac{1}{t}$ then $r.t$ is close to one, with an error corresponding to the error of the approximation $t$: $ | 1 - r.t |  < 2^{-f} $

However as $ -e $ can become big, $t$ range is quite large. It is not possible to limit the table for $r$ to a few hundreds entries. We suggest to rewrite the equation as follows:

$t = m_t \times 2^{e_t}$, $r' = fast\_rcp(m_t)$ (6)
$ log(t) = log(m_t \times 2^{e_t} \times \frac{r'}{r'}) $ (7)
$ log(t) = e_t \times log(2) + log(m_t \times r') - log(r')  $ (8)

With $m_t \times r' \sim 1$ and $m_t \in [1, 2[$ we get $r' \in [0.5, 1]$. $r'$ values corresponds to a few discrete values accross $[0.5, 1]$ determined by the implementation of $fast\_rcp$.
As those values can be known statically it becomes possible to tabulate $log(r')$ easily. For example if  $fast\_rcp$ provides an approximation based on the first $f$ bits of mantissa of $m_t$, the table will have $2^f$ entries and can be accessed by looking as the first $f$ bits of $m_t$'s mantissa (as the mapping between those bits and $r'$ is deterministic, it is easy to tabulate $log(r')$ for every possible value of $r'$. This is the example for example on Intel's IA architecture, where $f=12$.

   
Once the different terms of Equations (8) are obtained we can reconstruct the final $log(t)$. We have to be careful with catastrophic cancellation cases which are the big caveats to look for in floating-point approximations. In our current implementation, this is solved by using multi-word arithmetic for the final operations, before an eventual rounding to the final precision. But this could be further optimized.

Source code

The meta implementation of this method can be found here in the metalibm code generator function library. Below, two command line examples to generate single and double precision source code:


1
2
>>> python3 metalibm_functions/ml_log1p.py --precision binary32 
>>> python3 metalibm_functions/ml_log1p.py --precision binary64 


Conclusion

This method has nothing new, but listing the details here can be useful to understand how a floating-point approximation can be built. Comments are welcome.

Thank you to hugo B. for proof reading this article. This article initially published on March 19th 2019 was updated on March 30th 2019 (with typo fixes and clarifications).

References:
     The use of  fast_rcp for logarithm implementation was first described in P. Markstein's book IA-64 and Elementary Functions: Speed and Precision.

Docker Power

Docker hype is gone, long live docker ...

   Some coworker of mine just told me that when I explained to him I had just re-discovered the existence of docker. To be clear, I think I missed the train on that one by a long shot. Docker has been around and growing for a few years now since its initial public release in March 2013.

What is docker ?

  Let us start by what it is not. It not a virtual machine system (not per say), it is not a lazy way for lazy developper to package a simple application and its dependencies (or is it ?), it is a lightweight, elegant (does it matter ?) and interesting way to package / contain (in the sense separate, restrain and control) and deploy / distribute applications. The word application should be understood as some large runtime with very specific interactions and dependencies that you wish to deploy on a certain numbers of servers. I do not think it should be use to package basic application with very basic dependencies that are supposed to run once.

Ressource for using Docker on Ubuntu

A simple tutotial can be found here: https://docs.docker.com/install/linux/docker-ce/ubuntu/. The next stop will be reading about Dockerfile (kind of Makefile for a docker image) https://docs.docker.com/engine/reference/builder/#usage . Finally, you should have a okk at docker's hub (https://hub.docker.com/) which stores and indexes pre-built docker images. You can browse through it to look for the basis for your own image.

My day to day use of docker

I now use docker on a daily basis for metalibm (https://github.com/kalray/metalibm) and pythonsollya (https://gitlab.com/metalibm-dev/pythonsollya) integration. Those two tools have a lot of dependencies which do not exist under standard linux packages. It was very easy to build a docker container, use gitlab registry and fire it up every time I want to validate change.

References

  1. Docker official website https://www.docker.com/
  2. Wikipedia's page (always interesting) https://en.wikipedia.org/wiki/Docker_(software)
  3. Docker's hub (registry for docker images): https://hub.docker.com/


Determining numerical bound for floating-point reciprocal overflow


This article is both an example of pythonsollya use and a small study of floating-point property of the reciprocal function.

Context and objectives

Our goal is to find a bound $b$ such that when computing $\frac{1}{x}$ for any $x \lt b $  an IEEE overflow exception is raised (and the value corresponding to an overflow in the given rounding mode is returned). $b^*$ designs the exact value and $b$ the floating-point number of precision $p$ which verify the above property. Let us note that $b$ is not necessarily equal to $b^*$ rounded in a pre-defined rounding mode.

We will use the following notations: $\circ$ is the rounding operation which can be specialized by a rounding mode, $RNE$ is rounding to nearest (tie-to-even), $RU$ is rounding up (to $+\infty$) and $RD$ is rounding downward (to $-\infty)$, $next(x)$ is the minimal floating-point number strictly greater than $x$ and $prev(x)$ is the minimal floating-point number strictly lower than $x$.

Let us first define an overflow bound:

Defintion 1: $b$, (respectively $b^*$) is defined as the floating-point (resp. exact) overflow bound of the reciprocal function if it is the minimal floating-point (resp. real) number which verifies: $ \circ(\frac{1}{b})$ does not overflow.

Section 7.4.0 of the IEEE-754 standard defines overflow as: "The overflow exception shall be signaled if and only if the destination format’s largest finite number is exceeded in magnitude by what would have been the rounded floating-point result (...) were the exponent range unbounded. " This means that $b$ shoud depend on the rounding mode.

Let us first consider $b_{RNE}$ the bound associated to the round to nearest (tie-to-even) mode and defined as the maximal $x$ such that: $ \frac{1}{x} \ge B_{RNE} = \omega + \frac{1}{2} ulp(\omega) $.
$B_{RNE}$ is the mid-point between the largest floating-point number $\omega$ and the first value defined with an unbounded exponent range which will be rounded to $+\infty$ as per Section 7.4.
Note that the inequality does not need to be strict since $\omega$ has an odd mantissa ($1 ... 1$) if $\frac{1}{x} = B_{RNE}$ it should be rounded to a value greater than $\omega$ because of the tie-to-even rule, thus it will overflow.

$p$ is the format precision (e.g. $p=24$ for binary32 format).
$ b^*_{RNE} = \frac{1}{B_{RNE}} = \frac{1}{\omega + 2^{emax - p}} $
$ b^*_{RU} = \frac{1}{B_{RU}} = \frac{1}{\omega} $

Evaluating bound values

Let us compute (approximation to) $b^*_{RNE}$ and $b^*_{RU}$ using pythonsollya:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# importing pythonsollya
>>> import sollya
# forcing display setting to hexadecimal value
# (easier to read)
>>> sollya.settings.display = hexadecimal
# building 2.0 as a Sollya Object
>>> S2 = sollya.SollyaObject(2)
# building binary32's omega value (greatest floating-point number)
>>> omega = S2**127 * (S2**24 - 1) * S2**-23
>>> omega
0x1.fffffep127
>>> B_RU = omega
# omega + 1/2 ulp(omega)
>>> B_RNE = omega  + S2**103
# rounding 1 / B_RU with 100-bit precision
>>> sollya.round(1 / B_RU, 100, sollya.RD)
0x1.000001000001000001000001p-128
# rounding B_RNE to 100-bit precision
>>> sollya.round(1 / B_RNE, 100, sollya.RD)
0x1.0000008000004000002p-128
# rounding 1 / B_RNE to binary32 precision
>>> sollya.round(1 / B_RNE, sollya.binary32, sollya.RN)
0x1p-128
# rounding 1 / B_RU to binary32 precision
>>> sollya.round(1 / B_RU, sollya.binary32, sollya.RN)
0x1p-128

The first interestings values are given lines 16 and 19: correctly rounded approximation on 100 bits of $b_{RU} = \frac{1}{B_{RU}}$ and $b_{RNE} = \frac{1}{B_{RNE}}$. Let us first notice that those two values are greater than they counterpart rounded to binary32 (visible on lines 22 and 25).
Let us call $c_- = 2^{-128}$ and $c_+ = next(c_-) = 2^{-128} + 2^{-149} $ , we have $c_- \lt b_{RU} \lt c_+ $ and $c_- \lt b_{RNE}\lt c_+ $ which means $ \frac{1}{c_-} \gt \frac{1}{b_{RNE}} \gt \frac{1}{c_+} $ and $ \frac{1}{c_-} \gt \frac{1}{b_{RU}} \gt \frac{1}{c_+} $ and $\forall x , x \le c, \frac{1}{x} \gt \frac{1}{b_{RNE}}$

Claim 1: $ \forall mode, \circ_{mode}(\frac{1}{c_+}) $ is a normal value.
Claim 2: $ \forall mode, \circ_{mode}(\frac{1}{c_-}) $ overflows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
>>> c_minus = S2**-128
>>> c_plus = S2**-128 + S2**-149
>>> c_minus
0x1p-128
>>> c_plus
0x1.000008p-128
>>> for rnd_mode in [sollya.RN, sollya.RU, sollya.RD, sollya.RZ]:
...     print sollya.round(1 / c_minus, sollya.binary32, rnd_mode), sollya.round(1 / c_plus, sollya.binary32, rnd_mode)
infty 0x1.fffffp127
infty 0x1.fffff2p127
0x1.fffffep127 0x1.fffffp127
0x1.fffffep127 0x1.fffffp127

$c_+ $ is greater than both $b^*_{RU}$ and $b^*_{RNE}$, and $c_-$ is lower (strictly) than both of them. Thus, as there is no floating point number between $c_-$ and $c_+$ we can chose $c_-$ as the overflow bound (included) or $c_+$(excluded). There is no need to distinguish several values of $c$ based on the rounding mode. $\forall \ x \ge c_+$, $\frac{1}{x}$ will not overflow and $\forall \ x \le c_-$, $\frac{1}{x}$ will overflow in any rounding mode.

Let us now consider the remaining rounding modes (rounding down and towards zero). Without loss of generality we can reduce the study to positive overflow and thus to a single rounding mode, let us say rounding down.
$B_{RD} = next(\omega) = 2^{128} $ (assuming unbounded exponent range)


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
>>> B_RD = S2**128
>>> b_RD = S2**-128
>>> sollya.round(1 / B_RD, sollya.binary32, sollya.RD)
0x1p-128
>>> sollya.round(1 / B_RD, sollya.binary32, sollya.RU)
0x1p-128
>>> sollya.round(1 / c_minus, sollya.binary32, sollya.RD)
0x1.fffffep127
>>> sollya.round(1 / c_plus, sollya.binary32, sollya.RD)
0x1.fffffp127

$\frac{1}{B_{RD}}$ is exactly equal to the floating point number $c_-$ and $\frac{1}{c_-}$ will indeed lead to an overflow even in rounded down mode. This overflow will return the $\omega$ value as output.
$\circ_{RD}(\frac{1}{c_+}) = \omega - 2^{104} - 2^{105} - 2^{106} $ which is a normal number strictly lower than omega. Thus we get the same behavior as for $RNE$ and $RU$: $c_-$ is the overflow bound (included) and $c_+$ is the excluded overflow bound $\square$.

References:

  • Sollya: a tool for manipulating arbitrary precision numbers, computing approximation polynomial ... (Sollya's website)
  • pythonsollya: a python wrapper for sollya (pythonsollya's gitlab)
  • Quick introduction to IEEE-754 floating-point arithmetic standard (web)