Determining numerical bound for floating-point reciprocal overflow


This article is both an example of pythonsollya use and a small study of floating-point property of the reciprocal function.

Context and objectives

Our goal is to find a bound $b$ such that when computing $\frac{1}{x}$ for any $x \lt b $  an IEEE overflow exception is raised (and the value corresponding to an overflow in the given rounding mode is returned). $b^*$ designs the exact value and $b$ the floating-point number of precision $p$ which verify the above property. Let us note that $b$ is not necessarily equal to $b^*$ rounded in a pre-defined rounding mode.

We will use the following notations: $\circ$ is the rounding operation which can be specialized by a rounding mode, $RNE$ is rounding to nearest (tie-to-even), $RU$ is rounding up (to $+\infty$) and $RD$ is rounding downward (to $-\infty)$, $next(x)$ is the minimal floating-point number strictly greater than $x$ and $prev(x)$ is the minimal floating-point number strictly lower than $x$.

Let us first define an overflow bound:

Defintion 1: $b$, (respectively $b^*$) is defined as the floating-point (resp. exact) overflow bound of the reciprocal function if it is the minimal floating-point (resp. real) number which verifies: $ \circ(\frac{1}{b})$ does not overflow.

Section 7.4.0 of the IEEE-754 standard defines overflow as: "The overflow exception shall be signaled if and only if the destination format’s largest finite number is exceeded in magnitude by what would have been the rounded floating-point result (...) were the exponent range unbounded. " This means that $b$ shoud depend on the rounding mode.

Let us first consider $b_{RNE}$ the bound associated to the round to nearest (tie-to-even) mode and defined as the maximal $x$ such that: $ \frac{1}{x} \ge B_{RNE} = \omega + \frac{1}{2} ulp(\omega) $.
$B_{RNE}$ is the mid-point between the largest floating-point number $\omega$ and the first value defined with an unbounded exponent range which will be rounded to $+\infty$ as per Section 7.4.
Note that the inequality does not need to be strict since $\omega$ has an odd mantissa ($1 ... 1$) if $\frac{1}{x} = B_{RNE}$ it should be rounded to a value greater than $\omega$ because of the tie-to-even rule, thus it will overflow.

$p$ is the format precision (e.g. $p=24$ for binary32 format).
$ b^*_{RNE} = \frac{1}{B_{RNE}} = \frac{1}{\omega + 2^{emax - p}} $
$ b^*_{RU} = \frac{1}{B_{RU}} = \frac{1}{\omega} $

Evaluating bound values

Let us compute (approximation to) $b^*_{RNE}$ and $b^*_{RU}$ using pythonsollya:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# importing pythonsollya
>>> import sollya
# forcing display setting to hexadecimal value
# (easier to read)
>>> sollya.settings.display = hexadecimal
# building 2.0 as a Sollya Object
>>> S2 = sollya.SollyaObject(2)
# building binary32's omega value (greatest floating-point number)
>>> omega = S2**127 * (S2**24 - 1) * S2**-23
>>> omega
0x1.fffffep127
>>> B_RU = omega
# omega + 1/2 ulp(omega)
>>> B_RNE = omega  + S2**103
# rounding 1 / B_RU with 100-bit precision
>>> sollya.round(1 / B_RU, 100, sollya.RD)
0x1.000001000001000001000001p-128
# rounding B_RNE to 100-bit precision
>>> sollya.round(1 / B_RNE, 100, sollya.RD)
0x1.0000008000004000002p-128
# rounding 1 / B_RNE to binary32 precision
>>> sollya.round(1 / B_RNE, sollya.binary32, sollya.RN)
0x1p-128
# rounding 1 / B_RU to binary32 precision
>>> sollya.round(1 / B_RU, sollya.binary32, sollya.RN)
0x1p-128

The first interestings values are given lines 16 and 19: correctly rounded approximation on 100 bits of $b_{RU} = \frac{1}{B_{RU}}$ and $b_{RNE} = \frac{1}{B_{RNE}}$. Let us first notice that those two values are greater than they counterpart rounded to binary32 (visible on lines 22 and 25).
Let us call $c_- = 2^{-128}$ and $c_+ = next(c_-) = 2^{-128} + 2^{-149} $ , we have $c_- \lt b_{RU} \lt c_+ $ and $c_- \lt b_{RNE}\lt c_+ $ which means $ \frac{1}{c_-} \gt \frac{1}{b_{RNE}} \gt \frac{1}{c_+} $ and $ \frac{1}{c_-} \gt \frac{1}{b_{RU}} \gt \frac{1}{c_+} $ and $\forall x , x \le c, \frac{1}{x} \gt \frac{1}{b_{RNE}}$

Claim 1: $ \forall mode, \circ_{mode}(\frac{1}{c_+}) $ is a normal value.
Claim 2: $ \forall mode, \circ_{mode}(\frac{1}{c_-}) $ overflows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
>>> c_minus = S2**-128
>>> c_plus = S2**-128 + S2**-149
>>> c_minus
0x1p-128
>>> c_plus
0x1.000008p-128
>>> for rnd_mode in [sollya.RN, sollya.RU, sollya.RD, sollya.RZ]:
...     print sollya.round(1 / c_minus, sollya.binary32, rnd_mode), sollya.round(1 / c_plus, sollya.binary32, rnd_mode)
infty 0x1.fffffp127
infty 0x1.fffff2p127
0x1.fffffep127 0x1.fffffp127
0x1.fffffep127 0x1.fffffp127

$c_+ $ is greater than both $b^*_{RU}$ and $b^*_{RNE}$, and $c_-$ is lower (strictly) than both of them. Thus, as there is no floating point number between $c_-$ and $c_+$ we can chose $c_-$ as the overflow bound (included) or $c_+$(excluded). There is no need to distinguish several values of $c$ based on the rounding mode. $\forall \ x \ge c_+$, $\frac{1}{x}$ will not overflow and $\forall \ x \le c_-$, $\frac{1}{x}$ will overflow in any rounding mode.

Let us now consider the remaining rounding modes (rounding down and towards zero). Without loss of generality we can reduce the study to positive overflow and thus to a single rounding mode, let us say rounding down.
$B_{RD} = next(\omega) = 2^{128} $ (assuming unbounded exponent range)


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
>>> B_RD = S2**128
>>> b_RD = S2**-128
>>> sollya.round(1 / B_RD, sollya.binary32, sollya.RD)
0x1p-128
>>> sollya.round(1 / B_RD, sollya.binary32, sollya.RU)
0x1p-128
>>> sollya.round(1 / c_minus, sollya.binary32, sollya.RD)
0x1.fffffep127
>>> sollya.round(1 / c_plus, sollya.binary32, sollya.RD)
0x1.fffffp127

$\frac{1}{B_{RD}}$ is exactly equal to the floating point number $c_-$ and $\frac{1}{c_-}$ will indeed lead to an overflow even in rounded down mode. This overflow will return the $\omega$ value as output.
$\circ_{RD}(\frac{1}{c_+}) = \omega - 2^{104} - 2^{105} - 2^{106} $ which is a normal number strictly lower than omega. Thus we get the same behavior as for $RNE$ and $RU$: $c_-$ is the overflow bound (included) and $c_+$ is the excluded overflow bound $\square$.

References:

  • Sollya: a tool for manipulating arbitrary precision numbers, computing approximation polynomial ... (Sollya's website)
  • pythonsollya: a python wrapper for sollya (pythonsollya's gitlab)
  • Quick introduction to IEEE-754 floating-point arithmetic standard (web)