Coding Illness

Programming with RISC-V Vector extension: how to build and execute a basic RVV test program (emulation/simulation)

Update(s):

- Jan 16th 2022 adding section on Objdump

RISC-V Vector Extension (RVV) has recently been ratified in its version 1.0 (announcement and specification pdf). The 1.0 milestone is key, it means RVV maturity has reached a stable state: numerous commercial and free implementation of the standard are appearing and software developers can now dedicate significant effort to port and develop library on top of RVV without fear of seeing the specification rug being pulled under their feet. In this article we will review how to build a version of the clang compiler compatible with RVV (v0.10) and to develop, build and execute our first RVV program.

Building the compiler

Before building a compiler for RVV with need a basic riscv toolchain. This toolchain will provide the standard library and some basic tools require to build a functioning binary. The toolchain will be installed under ~/RISCV/ (feel free to adapt this directory to your setup).

# update to the your intended install directory
export RISCV=~/RISCV-TOOLS/

# downloading basic riscv gnu toolchain, providing:
# - runtime environement for riscv64-unknown-elf (libc, ...)
# - spike simulator
git clone https://github.com/riscv-collab/riscv-gnu-toolchain
./configure --prefix=$RISCV
make -j3

Compiling for RVV requires a recent version of clang (this was tested with clang 14).

# downloading llvm-project source from github
git clone https://github.com/llvm/llvm-project.git
cd llvm-project
# configuring build to build llvm and clang in Release mode
# using ninja
# and to use gold as the linker (less RAM required)
# limiting targets to RISCV, and using riscv-gnu-toolchian
# as basis for sysroot
cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;" \
      -DCMAKE_BUILD_TYPE=Release \
      -DDEFAULT_SYSROOT="$RISCV/riscv64-unknown-elf/" \
      -DGCC_INSTALL_PREFIX="$RISCV" \
      -S llvm -B build-riscv/ -DLLVM_TARGETS_TO_BUILD="RISCV"
# building clang/llvm using 4 jobs (can be tuned to your machine)
cmake --build build/ -j4

Building clang/llvm require a large amount of RAM (8GB seems to be the bare minimum, 16GB is best) and will consume a lot of disk space. Those requirements can be reduced by selecting Release build type (rather than the default Debug) and by using gold linker.

This process will generate clang binary in llvm-project/build/bin/clang .

More information on how to download and build clang/llvm can be found on the project github page.
Development.

The easiest way to develop software directly is for RVV is to rely on the rvv intrinsics. This project offers intrinsics for most of the instruction of the extension. The documentation is accessible on github and support is appearing in standard compilers (most notably clang/llvm).

As a first exercise, we will use the SAXPY example from rvv-intrinsic-doc rvv_saxpy.c.

Building the simulator

Let's first build an up-to-date proxy kernel pk:

# downloading and install proxy-kernel for riscv64-unknown-elf
git clone https://github.com/riscv-software-src/riscv-pk.git
cd riscv-v
mkdir build && cd build
make -j4 && make install
../configure --prefix=$RISCV --host=riscv64-unknown-elf

Let's now build the simulator (directly from the top of the master branch, why not !).

git clone https://github.com/riscv-software-src/riscv-isa-sim
cd riscv-isa-sim
mkdir build && cd build
../configure --prefix=$RISCV
make -j4 && make install

Building the program

RVV is supported as part of the experimental extensions of clang. Thus it must be enabled explicitly when executing clang, and it must be associated with a version number, the current master of clang only support v0.10 of the RVV specification.

clang -L $RISCV/riscv64-unknown-elf/lib/ --gcc-toolchain=$RISCV/ \

       rvv_saxpy.c -menable-experimental-extensions -march=rv64gcv0p10 \
      -target riscv64 -O3 -mllvm --riscv-v-vector-bits-min=256 \

       -o test-riscv-clang

Executing

To execute the program we are going to use the spike simulator and the riscv-pk proxy kernel.

Spike is part of the riscv-gnu-toolchain available at https://github.com/riscv-collab/riscv-gnu-toolchain , riscvv-pk is also available on github. https://github.com/riscv-software-src/riscv-pk

the binary image of pk must be the first unnamed argument to spike before the main elf.

$RISCV/bin/spike --isa rv64gcv $RISCV/riscv64-unknown-elf/bin/pk \

                  test-riscv-clang

NOTES: I tried to use riscv-tools (https://github.com/riscv-software-src/riscv-tools) does not seem actively maintain and several issue poped up when I tried building it.

Objdump

Not all objdump support RISC-V vector extension. If you have built llvm has indicated above, you should be able to use the llvm-objdump program built within to disassemble a program with vector instructions.

llvm-objdump -d --mattr=+experimental-v <binary_file>

References

IREE (MLIR dialect) page of riscv-v cross compilation https://google.github.io/iree/building-from-source/riscv/
Official llvm github project https://github.com/llvm/llvm-project
RVV intrinsic documentation on github https://github.com/riscv-non-isa/rvv-intrinsic-doc
https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/GiTkPw-9r8A?pli=1
Properly configuring clang/llvm build https://stackoverflow.com/questions/68580399/using-clang-to-compile-for-risc-v

Assisted assembly development for RISC-V RV32

In this post we will present how the assembly development environment tool (asmde) can ease assembly program development for RISC-V ISA.

You will develop a basic floating-point vector add routine.

Introducing ASMDE

The ASseMbly Development Environment (asmde, https://github.com/nibrunie/asmde) is an open-source set of python utility to help the assembly developper. The main eponym utility, asmde, is a register assignation script. It consumes a templatized assembly source file and fill in variable names with legal register, removing the burden of register allocation from the developper.

Recently, alpha support for RV32 (32-bit version of RISC-V) was added to asmde. We are going to demonstrate how to use it in this post.

Vector-Add testbench

The example we chose to implement is a basic vector add.

/** Basic single-precision vector add
 *  @param dst destination array
 *  @param lhs left-hand side operand array
 *  @param lhs right-hand side operand array
 *  @param n vector sizes
 */
void my_vadd(float* dst, float* lhs, float* rhs, unsigned n);

The program is split in two files:

- a test bench main.c

- an asmde template file vec_add.template.S

Review of the assembly template

The listing below present the input template. It consists in a basic assembly source file extended with some asmde specific constructs.

// testing for basic RISC-V RV32I program
// void vector_add(float* dst, float* src0, float* src1, unsigned n)
//#PREDEFINED(a0, a1, a2, a3)
        .option nopic
        .attribute arch, "rv32i2p0_m2p0_a2p0_f2p0_d2p0"
        .attribute unaligned_access, 0
        .attribute stack_align, 16
        .text
        .align  1
        .globl  my_vadd
        .type   my_vadd, @function
my_vadd:
        // check for early exit condition n == 0
        beq a3, x0, end
loop:
        // load inputs
        flw F(LHS), 0(a1)
        flw F(RHS), 0(a2)
        // operation
        fadd.s F(ACC), F(LHS), F(RHS)
        // store result
        fsw F(ACC), 0(a0)
        // update addresses
        addi a1, a1, 4
        addi a2, a2, 4
        addi a0, a0, 4
        // update loop count
        addi a3, a3, -1
        // branch if not finished
        bne x0, a3, loop
end:
        ret
        .size   my_vadd, .-my_vadd
        .section        .rodata.str1.8,"aMS",@progbits,1

ASMDE Macro

The mandatory comment are followed by an asmde macro PREDEFINED.

This macro indicates to asmde assignator that the argument list of registers should be considered alive when entering the function. It is often used to list function arguments.

ASMDE Variable

The second construct provided by asmde are the assembly variables.

                flw F(LHS), 0(a1)
                flw F(RHS), 0(a2)
                // operation
                fadd.s F(ACC), F(LHS), F(RHS)
                // store result
                fsw F(ACC), 0(a0)

Those variables are of the form <specifier>(<varname>). In this example we use the specifier F for floating-point register variables. The specifiers X or I can be used for integer registers. These variables are used to manipulate (write to / read from) virtual registers. asmde will perform the register assignation, taken into account the instruction semantics and the program structure.

Here for example, we used F(LHS) variable to load an element of the left-hand side vector, F(RHS) to load elements from the right-hand side vector and F(ACC) contains the sum of those two variables which is later stored back into the destination array.

Assembly template translation

asmde can be invoked as follow to generate an assembly file with assigned registers:

python3 asmde.py -S --arch rv32 \
                 examples/riscv/test_rv32_vadd.S \

                --output vadd.S

Building and executing the test program

We can build our toy example alongside a small testbench:

#include <stdio.h>

#ifdef LOCAL_IMPLEMENTATION
void my_vadd(float* dst, float* lhs, float* rhs, unsigned n){
    unsigned i;
    for (i = 0; i < n; ++i)
        dst[i] = lhs[i] + rhs[i];
}
#else
void my_vadd(float* dst, float* lhs, float* rhs, unsigned n);
#endif


int main() {
    float dst[4];
    float a[4] = {1.0f, 2.0f, 3.0f, 4.0f};
    float b[4] = {4.0f, 3.0f, 2.0f, 1.0f};
    my_vadd(dst, a, b, 4);

    int i;
    for (i = 0; i < 4; ++i) {
        if (dst[i] != 5.0f) {
            printf("failure\n");
            return -1;
        }
    }

    printf("success\n");
    return 0;
}

And finally execute it.

(requires rv32 gnu toolchain and a 32-bit proxy kernel pk)

# building test program
$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -o test_vadd vadd.S test_vadd.c
# executing binary
$ spike --isa=RV32gc riscv32-unknown-elf/bin/pk  ./test_vadd

Conclusion

I hope this small example was useful to you and that you will be able to use asmde in your own project.

If you find issues (there are many), you can report them on github https://github.com/nibrunie/asmde/issues/new/choose . If you have some feedback do not hesitate to write a comment here.

Happy hacking with RISC-V.

References:

- asmde github page: https://github.com/nibrunie/asmde

- RISC-V unpriviledged ISA specification

- GNU Toolchain for RISC-V

- Programming with RISC-V vector instructions

A brief history of open source libms (mathematical libraries)

In this article, we try to provide a brief history of the C mathematical library and its various implementation (focusing on open-source variants for now).

The C libm seems to be the de facto standard for mathematical functions.

Libm specification ?

Some information can be found on in the ISO standards, for example in C99 standard (ISO/IEC 9899:[2011|1999|1990]) [6]. The main libm header, namely math.h is described in Section 7.2 of this standard.

Section 5.3.4.2.2-§5 covers the characteristics of floating types. In particular it states that:

“The accuracy of the floating-point operations (+, -, *, /) and of the library functions in and that return floating-point results is implementation defined,”. This means that there is no requirement on operation nor function accuracy: not even faithful rounding is required !

Section 6.5-§8 of standard covers the possibility for expression contraction. It states the following:

“A floating expression may be contracted, that is, evaluated as though it were an atomic operation, thereby omitting rounding errors implied by the source code and the expression evaluation method.75) The FP_CONTRACT pragma in provides a way to disallow contracted expressions. Otherwise, whether and how expressions are contracted is implementation-defined.”

This seems to indicate that function composition simplification, a.k.a function expression contraction, is allowed.

The standard’s Annex F 9 (which is normative) covers the list of functions and indicates which behaviours are expected in special cases.

Online libm documentations are available in [7, 8].

The libm is used as a basis for various other languages, e.g. C++ https://en.cppreference.com/w/cpp/header/cmath.

Libm implementation

Most libm implementations originate either from fdlibm or IBM’s libultim [4]

OpenLibm: (https://openlibm.org/, https://github.com/JuliaMath/openlibm )

Supported by the JuliaProject

“The OpenLibm code derives from the FreeBSD msun and OpenBSD libm implementations, which in turn derive from FDLIBM 5.3. Over and above that, OpenLibm itself has received a number of patches to make it platform independent and portable.”

FDLibm: Freely Distributable Libm (http://www.netlib.org/fdlibm/)

Version 5.3, developed at Sun Microsystems, Inc.

Newlib (https://sourceware.org/newlib/)

C libraries for embedded systems

CRlibm: https://gforge.inria.fr/scm/browser.php?group_id=5929&extra=crlibm

Correctly rounded library (error less or equal to half an ulp)

Glibc’s libm (https://sourceware.org/glibc/wiki/libm)

https://sourceware.org/git?p=glibc.git;a=tree;f=math;hb=HEAD

MUSL’s libm (https://git.musl-libc.org/cgit/musl/tree/src/math)

Sleef https://sleef.org/

Focus on fast SIMD implementations

APMathLibm, Libultim (https://www.math.utah.edu/cgi-bin/man2html.cgi?/usr/local/man/man3/libultim.3)

Sun Microsystems libMCR https://github.com/simonbyrne/libmcr

http://www.math.utah.edu/cgi-bin/man2html.cgi?/usr/local/man/man3/libmcr.3

The standard libm provides the most current mathematical functions. There are extension for “special” functions: Mathematical Special Functions: https://en.cppreference.com/w/cpp/numeric/special_functions

Open questions:

How much code came from Sun Microsystem development ?

What are the libraries’ performances ?
What are the libraries’ accuracies ? [crlibm, libultim and libmcr seems to fall in the correctly rounded category]
What are the libraries’ specificities ?

Intel MKL is another reference (also not open source and not portable)

Bibliography

[0] Towards the post-ultimate libm, Dinechin et al (http://www.acsel-lab.com/arithmetic/arith17/papers/ARITH17_Dinechin.pdf)
[1] https://indico.esa.int/event/239/contributions/2254/attachments/1842/2144/0900b_-_Arregui.pdf
[2] https://www.vinc17.net/research/papers/ieeetc2009-powr.pdf
[3] “Fast evaluation of elementary mathematical functions with correctly rounded last bit
[4] https://www.openwall.com/lists/musl/2012/01/23/1 : discussion on musl mailing list about libm implementation choices
[5] http://nsz.repo.hu/libm libm for musl
[6] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
https://web.archive.org/web/20181230041359if_/http://www.open-std.org/jtc1/sc22/wg14/www/abq/c17_updated_proposed_fdis.pdf
[7] https://en.cppreference.com/w/c/numeric/math
[8] Oracle doc on libm: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_lib.html

Introduction to polynomial approximation (WiP)

The goal is to find the best polynomial to approximate a given function $f$ on a given interval $I$.
The best generally meaning the polynomial which minimizes the error (absolute or relative) with the input function. This error is often $L-\infty$ (the maximum of the absolute difference between the function and the approximating polynomial).

Remez:

The Remez algorithm find solution to linear system of $n+2$ equations $\sum_i{a_i.x_j^i} + (-\epsilon)^i = f(x_j) $ for j in $[0, n+1]$. It starts by setting $x_j$ values to the extrema of the degree $n$ Chebyshev polynomial (mapped to the input interval through affine scaling). Then solve the linear systems which provides a first set of $a_i$ coefficients. Then the extrema of $| P - f $ are found and used as a new set of $x_j$ and the processus is iterated. Theoretically, the process stops when the approximation errors are all equal and of alternating sign, the returned P is the minmax approximation polynomial.

During the iteration the reachable lower bound for the approximation error is the minimum of the absolute value of the extrema the difference $P - f$ (therorem de la Vallée Poussin)

Using Euclidean Lattices

Educational example of polynomial approximation: https://github.com/nibrunie/YourOwnPersonnalRemez

References

- Efficient L-inf polynomial approximations (https://hal.archives-ouvertes.fr/inria-00119513v1)

- Wikipedia's page on Remez algorithm

- Some information on De La Vallée Poussin's Theorem

Testing metalibm from the demo Docker image

A docker image is available with a demonstration version of metalibm:

docker run -ti nibrunie/metalibm:demo

The basic metalibm commands can be tested within this image:

cd /home/metalibm
export PYTHONPATH=$PWD:$PYTHONPATH
export ML_SRC_DIR=$PWD
python3 metalibm_functions/ml_exp.py --auto-test 1000 --execute

The Dockerfile used to build this image is:

FROM ubuntu:latest

# updating repo list
RUN apt-get update

# installing basic packages (build, repo management, ...)
RUN apt-get install -y build-essential
RUN apt-get install -y git
RUN apt-get install -y wget

# install sollya's dependency
RUN apt-get install -y libmpfr-dev libmpfi-dev libfplll-dev libxml2-dev

RUN apt-get install -y python3 python3-setuptools libpython3-dev python3-pip

# install sollya from its current weekly release (which implement sollya_lib_obj_is_external_data
# contrary to sollya 7.0 release)
WORKDIR  /home/
RUN wget http://sollya.gforge.inria.fr/sollya-weekly-08-18-2019.tar.gz
RUN tar -xzf sollya-weekly-08-18-2019.tar.gz
WORKDIR /home/sollya-weekly-08-18-2019/
RUN mkdir -p /app/local/python3/
RUN ./configure --prefix /app/local
RUN make -j8
RUN make install


# installing pythonsollya
WORKDIR /home/
ENV PATH=/app/local/bin:$PATH
ENV SOLLYA_DIR=/app/local/
ENV PREFIX=/app/local/
ENV INSTALL_OPTIONS="-t /app/local/"

# installing pythonsollya from gitlab version
RUN git clone https://gitlab.com/metalibm-dev/pythonsollya /home/new_pythonsollya/

# for python 3
WORKDIR /home/new_pythonsollya/
RUN make SOLLYA_DIR=/app/local/ PREFIX=/app/local/ INSTALL_OPTIONS="-t /app/local/python3/" PYTHON=python3 PIP=pip3 install

# checking pythonsolya install
RUN apt-get install python3-six
RUN LD_LIBRARY_PATH="/app/local/lib" python3 -c "import sollya"

# installing gappa
WORKDIR /home/
RUN apt-get install -y libboost-dev
RUN wget https://gforge.inria.fr/frs/download.php/file/37624/gappa-1.3.3.tar.gz
RUN tar -xzf gappa-1.3.3.tar.gz
WORKDIR /home/gappa-1.3.3/
RUN ./configure --prefix=/app/local/
RUN ./remake
RUN ./remake install

# downloading metalibm
WORKDIR /home/
RUN git clone https://github.com/kalray/metalibm.git
WORKDIR /home/metalibm/

# testing metalibm install
ENV LD_LIBRARY_PATH=/app/local/lib/
ENV PYTHONPATH=/app/local/python3/
RUN PYTHONPATH=$PWD:$PYTHONPATH  ML_SRC_DIR=$PWD python3 valid/soft_unit_test.py

Computing floating-point division from an accurate reciprocal [WiP]

This article addresses the problem of computing $\circ(\frac{x}{y})$ from $\circ(\frac{1}{y})$, targetting single and double floating-point precision.

New method description

The method we use is based on the end of a Newton-Raphson division iteration [1, 2]. The iteration starts with an accurate approximation $r$ of $\frac{1}{y}$ and $a_0 = x \times \circ(\frac{1}{y})$

the error of $a_0$ is $\epsilon_0$,
$\delta_0 = | \frac{x}{y} - \circ(x \times r)|$
$\delta_0 = | \frac{x}{y} - x \times \frac{1}{y} (1 + \epsilon_r) (1 +\epsilon_{mult}) |$
$\delta_0 = | \frac{x}{y} \times (\epsilon_r + \epsilon_{mult} + \epsilon_r . \epsilon_{mult}) | $
$\epsilon_0 = \frac{\delta_0}{\frac{x}{y}} \le | \epsilon_r + \epsilon_{mult} + \epsilon_r . \epsilon_{mult} |$

$r$ is assumed to be a correctly rounded approximation of $\frac{1}{y}$ and $a_0$ is also implemented as a correctly rounded multiplication, so $ \epsilon_{mult}$ and $\epsilon_r$ are bounded by $2^{-53} $ which means:

$\delta_0 \le |2^{-52} + 2^{-106}|$: $a_0$'s accuracy is a little greater than one $ulp$. Our goal is to get a final accuracy of less than one half of one $ulp$ (i.e. a correctly rounded result).

Our method is based on the following (standard) iteration:
$ e_1 = x - y \times a_0 $
$ a_1 = a_0 + e_1 \times r $

A piece of pythonsollya code available at the end of this article can help measure the accuracy of this method in a few random test cases.

Without special care this method is not valid for all inputs:

$r$ overflows if $y <= 2^{-127}$
computing $r$ raises DivByZero even if $x$ is not a canonical number

Special cases management

Once the numerical behavior is proven we can focus on the behavior on special cases.

x	y	$\frac{1}{y}$	$\frac{x}{y}$	New method
x	y	$\frac{1}{y}$	$\frac{x}{y}$	$a_0$	$e_1$	$a_1$
0	0	$\infty$ DBZ	$qNaN$ IO	$qNaN$ IO	$qNaN$	$qNaN$
	Num	Num	$0$	$0$	$0$	$0$
	$\infty$	$0$	$0$	$0$	$0$	$0$
	$qNaN$	$qNaN$	$qNaN$	$qNaN$	$qNaN$	$qNaN$
$\infty$	0	$\infty$ DBZ	$\infty$	$\infty$	$qNaN$ IO	$qNaN$
	Num	Num	$\infty$	$\infty$	$qNaN$ IO	$qNaN$ IO
	$\infty$	$0$	$0$	$0$	$0$	$0$
	$qNaN$	$qNaN$	$qNaN$	$qNaN$	$qNaN$	$qNaN$
$Num$	0	$\infty$ DBZ	$\infty$ DBZ	$\infty$	$qNaN$ IO	$qNaN$
	Num	Num (A)	Num (A)	Num	Num	Num
	$\infty$	$0$	$0$	$0$	$0$	$0$
	$qNaN$	$qNaN$	$qNaN$	$qNaN$	$qNaN$	$qNaN$
$NaN$	0	$\infty$ DBZ	$\infty$ DBZ	$qNaN$	$qNaN$	$qNaN$
	Num	Num (A)	$qNaN$	$qNaN$	$qNaN$	$qNaN$
	$\infty$	$0$	$qNaN$	$qNaN$	$qNaN$	$qNaN$
	$qNaN$	$qNaN$	$qNaN$	$qNaN$	$qNaN$	$qNaN$

As illustrated by the table above, most cases behave similarly if we first compute $\frac{1}{y}$ as a single floating-point operation and then use the iteration to compute $\frac{x}{y}$. Here behaving similarly means returning the same results and raising the same exception flags.
However there are a few cases (indicated in bold red) were a spurious flag can be raisen or an incorrect result returned. We have to carrefully consider the flags during each of the steps of the new methods: $\circ(\frac{1}{y})$, $a_0$, $e_1$, $a_1$ as those stucking flags will be accumulated during the computation.
The case (A) requires specific attention. An overflow in $\frac{1}{y}$ can occur (with or without the IEEE overflow flags) while $\frac{x}{y}$ remains in the range of normal numbers. For example if $y=2^{-149}$ (the smallest subnormal numbers) and $x = 2^{-126}$ (the smallest normal number). $\circ_{RN}(\frac{1}{y}) = +\infty $ but $\circ_{RN}(\frac{x}{y}) = 2^{23}$.

To avoid intermediary overflows or underflows or spurious exceptions we modify the method as follows:

$ y', s = normalize(y) $ (silent)
$ x', t = normalize(x) $ (silent)
$ r = \circ(\frac{1}{y'})$ (silent)
$ - = fsdiv(x, y) $ (result discarded)
$ a_0 = \circ(x' \times r)$
$ e_1 = x' - y' \times a_0 $ (silent)
$ a_1 = a_0 + e_1 \times r $ (silent)
$ n, m = 1, 1 \ if ' x' > y' \ else \ 0.5, 2 $
$ R = (a_1 \times m) \times (n \times s \times t) $

Each operation annoted with (silent) does not raise IEEE754 flags nor exceptions. For a number $y$ The normalize function returns the floating-point mantissa $y'$ of $y$ and the reciprocal $s$ of the scaling factor such that $y'$ and $s$ are floating-point numbers $y' = s \times y$, $1 \le y' < 2$, and $s$ is a power of 2.
If $y=\pm 0$, $y = \pm \infty$ or $y=NaN$ then $normalize(y) = y, 1.0 $.
normalize does not raise any IEEE754 flags.

Assuming $x$ and $y$ are non-zero numbers , $1 \le |x'| < 2$ and $1 \le |y'| < 2$ which means $ \frac{1}{2} \le | \frac{x'}{y'} | \le 2 $.

The floating-point seed for division function, $fsdiv(x, y)$, is a specific instruction which raises all the specific cases IEEE flags as if it computed $\circ(\frac{x}{y})$ but does not compute the actual result (nor detect and raise flags for numerical overflow nor underflow). It can been implemented easily and cheaply (and already exists in some architectures).

Managing overflow and underflow

We must ensure that no loss of precision occurs during the intermediate iteration while the final result is correctly rounded (even if subnormal).

To implement this we rely on normalized inputs $x'$ and $y'$ and on normalization factors $s$ and $t$.

$s = 2^p$ and $t = 2^q$, $e_{min} \le p, q \le e_{max} $ (e.g $ e_{min} = -149$, $e_{max}=127$ for $fp_{32}$ format)

While $a_1$ remains within $[0.5, 2]$, $s \times t$ may overflow or underflow.
Thus we introduced the extra scaling factors $n$ and $m$ which brings $a_1 \times m \ in \ [1, 2[$
and if $f = n \times m \times t$ overflows then $R$ overflows, and reciprocaly if it less than the the smallest normal numbers and $a_1$ is inexact then $R$ underflows.

The following code, based on pythonsollya, implements a quick and dirty test to check the method error.

# -*- coding: utf-8 -*-
import sollya
import random

sollya.settings.display = sollya.hexadecimal

def sollya_fma(x, y, z, precision, rndmode=sollya.RN):
    """ implement fused multiply and add using sollya.
        The internal precision must be enough so that x * y
        is computed exactly """
    return sollya.round(x * y + z, precision, rndmode)

def sollya_mul(x, y, precision, rndmode=sollya.RN):
    """ implement multiply using sollya """
    return sollya.round(x * y, precision, rndmode)

def iteration(x, y, approx_div, approx_recp, precision=sollya.binary64):
    """ implementation of refinement iteration """
    e_n = sollya_fma(y, -approx_div, x, precision)
    r_n = sollya_fma(approx_recp, e_n, approx_div, precision)
    return r_n


def my_division(x, y, precision=sollya.binary64, rndmode=sollya.RN):
    """ full division routine """
    approx_recp = sollya.round(1.0 / y, precision, rndmode)
    approx_div = sollya_mul(approx_recp, x, precision)
    NUM_ITER = 1
    for i in range(NUM_ITER):
        approx_div = iteration(x, y, approx_div, approx_recp, precision)
    return approx_div


NUM_TEST = 1000000
PRECISION = sollya.binary64
RNDMODE = sollya.RN
num_error = 0

for i in range(NUM_TEST):
    input_value_x = sollya.round(random.random(), PRECISION, RNDMODE)
    input_value_y = sollya.round(random.random(), PRECISION, RNDMODE)
    expected = sollya.round(input_value_x / input_value_y, PRECISION, RNDMODE)
    result = my_division(input_value_x, input_value_y, PRECISION, RNDMODE)
    error = abs(expected - result)
    rel_error = abs(error / expected)
    if error != 0.0:
        num_error += 1
        print("{} / {} = {}, result is {}, error = {}".format(input_value_x, input_value_y, expected, result, error))


print("{} error(s) encountered".format(num_error))

References:

[1] Newton-Raphson Algorithms for Floating-Point Division Using an FMA, JM Muller et al. (pdf)
[2] Proving the IEEE Correctness of Iterative Floating-Point Square Root, Divide, and Remainder Algorithms, M Cornea-Hassegan (pdf)
[3] pythonsollya's gitlab (Python wrapper for the Sollya library)

Accelerating Erasure Coding with Bit Matrix Multiply

Introduction

In this article we will present a method for fast implementation of Erasure Coding primitives based on Galois Field arithmetic. The Jerasure library manual (pdf) is a good introduction to the erasure coding method considered in this article.

For or purpose it suffices to say we have $r$ partial stripes. Each striple is constituted by $k$ $w$-bit data words and we want to compute the $m$ $w$-bit parity words to complete each striple.

Each parity word is computed as a dot-product of the data striple multiplied by a constant vectors whose values depends on the index of the parity word we want to compute (those index ranges from
$0$ to $m-1$). The vector elements and dot-product results are in $GF(2^w)$.

Multiplication by a constant through vector by matrix multiplication

We will consider the primitive operation on a $w$-bit data word $D$, which must be multiplied by a $w$-bit constant $C$ to get a $w$-bit parity word $P$: $D \times C = P$

The methode, due to Mastrovito (detailed in [5]), expands a multiplication by a constant and the modulo reduction into a multiplication by a bit matrix as follows:
$$
\begin{pmatrix}
data word
\end{pmatrix}
\times
\begin{pmatrix}
C . X^0 [g] \\
C . X^1 [g] \\
\vdots \\
C . X^{k-1} [g]
\end{pmatrix}
=
\begin{pmatrix}
parity word
\end{pmatrix}
$$

The multiplicand matrix is built by expanding $C . X^i [g] $ for each row $i$ of the matrix.
The matrix multiplication will accumulate the multiplication $d_i \times X^i [g] $ to build the final result:
\begin{align}
P & = D \times C [g] \\
& = (\oplus_{i=0}^{w-1} d_i \times X^i) \times C [g] \\
& = (\oplus_{i=0}^{w-1} d_i . C \times X^i) [g] \\
& = \oplus_{i=0}^{w-1} d_i . (C \times X^i [g] )
\end{align}

\begin{align*}
\times &

\begin{pmatrix}
c_{0,0} & c_{0,1} & \dots & c_{0,w-1} \\
c_{1,0} & c_{1,1} & \dots & c_{1,w-1} \\
\vdots & & & \vdots \\
c_{k-1,0} & c_{k-1,1} & \dots & c_{w-1,w-1} \\
\end{pmatrix} \\
\begin{pmatrix} d_0 & d_1 & \dots & d_{w-1} \end{pmatrix} &
\begin{pmatrix}
p_0 & p_1 & \dots & p_{w-1}
\end{pmatrix}

\end{align*}

Multiplication parallelization

We have seen how to use vector by matrix multiplication to compute the multiplication of a data word to get part of a parity word (we still have to accumulate those parts to get the actual parity word). We can extend the left hand side vector to a full matrix and implements multiple multiplications by the same expanded constant C, realizing a Single Instruction Multiple Data (SIMD) operation.
With a single operation implementing a multiplication between $w \times w$ matrices we implement $w$ parallel multiplication by a constant C.

Erasure code operation

The macro operation for to compute erasure coding parity is a matrix multiplication where the right hand side matrix element are constant in $GF(2^w)$. This operation is decomposed into vector by constant vector multiplications, themselves decomposed into multiplication by constants and $xor$ accumulations.

References:

[1] Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions (pdf)
[2] Improving the coding speed of erasure codes with polynomial ring transforms (pdf)
[3] A New Architecture for a Parallel Finite Field Multiplier with Low Complexity Based on Composite Fields, C. Paar (pdf)
[4] Jerasure library manual (pdf)
[5] Fast and Secure Finite Field Multipliers (pdf) by Pamula and Tisserand (contains description of Mastrovito's method)