Coding Illness: risc

Showing posts with label risc. Show all posts

RISC-V Vector Register Groups

The RISC-V Vector Extension (RVV) defines and exploits the concept of vector register groups. This post clarifies what a register group is, how it is useful and the caveat associated with it.

A single vector instruction can operate on operands each consisting of more than one register and can produce similarly extended results.

A group is a bundle of multiple registers. RVV 1.0 supports groups of 1, 2, 4 or 8 registers. The indices of registers in a group span a contiguous range and the lowest index must be a multiple of the group size. For example the vector registers [4, 5, 6, 7] form a valid group for size 4, but [5, 6] is not a valid group for size 2. The lowest index is used to specify the vector register group in an instruction opcode: e.g. when LMUL=4, v4 stands for v4v5v6v7. Thus the actual sizes of the operands (and destination) of an instruction depends on the context (the vtype configuration as we will see later): vadd.vv v4, v0, v12 may operate on 1, 2 or 4 wide register groups depending on the context (8 is not a legal size as neither v4 nor v12 are aligned on an index multiple of 8).

The size of a vector register group is called the vector length multiplier (a.k.a LMUL) in RVV specification: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#342-vector-register-grouping-vlmul20.

For an operation, LMUL is set by the vlmul field in vtype (so it is usually modified through vset instructions). The following RISC-V assembly sequence first defines LMUL as 8 and then executes a double precision vector addition.

	vsetvli	x1, zero, e64, m8, ta, mu
	vfadd.vv	v16, v8, v24

Since the LMUL value is 8, the actual sequence of operation is equivalent to the following (assuming each instruction operates on a single-register wide group):

	vfadd.vv	v16, v8,  v24

	vfadd.vv	v17, v9,  v25

	vfadd.vv	v18, v10, v26

	vfadd.vv	v19, v11, v27

	vfadd.vv	v20, v12, v28

	vfadd.vv	v21, v13, v29

	vfadd.vv	v22, v14, v30

	vfadd.vv	v23, v15, v31

LMUL encoding in vtype

In RVV 1.0, vlmul can take 4 values for integral LMUL: 0b000 (LMUL=1), 0b001 (LMUL=2), 0b010 (LMUL=4), 0b011 (LMUL=8), it can also take 3 fractional values: 0b111 (LMUL=1/2), 0b110 (LMUL=1/4), 0b101 (LMUL=1/8), but more on that in a future post.

LMUL vs EMUL

The actual length multiplier used for an operation operand or destination is called EMUL, for effetctive LMUL. EMUL depends on vlmul, but also on the operation. For example EMUL=LMUL for any source or destination of a vadd.vv, but EMUL=2*LMUL for both vd and vs2 of vwadd.wv (because of its widening characteristic). EMUL can never be larger than 8 (it is not possible for a vector register group to exceed 8 registers) so LMUL=8 is reserved for widening instructions: it is not supported in RVV 1.0.

Some operations which embed the effective element width (EEW) in their opcode define EMUL to keep the following ratio equality EEW/EMUL=SEW/LMUL. For example if LMUL=1, SEW=8-bit, then vle32 admits EEW=32-bit and EMUL=EEW*LMUL/SEW=4, this also mean the register group destination will have to be described by a register whose index is a multiple of 4 (EMUL).

Since EMUL can differ between an instructions operands and it results, partial or complete overlap between register groups are a possibility. RVV 1.0 authorizes 3 types of overlap between source and destination register groups. Those overlaps are specified here https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#52-vector-operands and illustrated by the Figure below (e.g. of v0v1v2v3 register group in various source/destination EEW configurations).

Benefits of register grouping

There are several benefits associated to the availability of register groups:

even if architectural registers are "only" VLEN-bit wide, single instructions can perform operation on larger operands and produce larger results:

for widening operation results or narrowing operation source, having the possibility of building a register groups enables the use of a full register source or enables populating a full register destination
providing large input sets: for example for a vrgather the input data set contains VLMAX elements, where VLMAX=VLEN/SEW*LMUL increases with LMUL: one can build larger lookup table by using larger register groups.
allowing reuse of operands between extended registers (e.g. immediate or scalar operand)
reducing code size: for data parallel section with vector lengths larger than what can fit in a single vector register, less instructions are required to operate on the same amount of data.
reducing number of loop iterations: since VLMAX increases with LMUL, during loop stripmining, less iterations are required when LMUL increases.

exposing more opportunity to micro-architecture

even if a micro-architecture can not operate directly on register groups larger than VLEN-bit wide, it can still exploit micro-op parallelism and chaining opportunities exposed by larger register groups.

Constraints induced by register grouping

One of the caveat of register grouping is that the number of architectural registers decreases with the length multiplier value. For example there are only 4 8-wide register groups: v0, v8, v16, v24. This means that register allocation pressure can be a limiting factor when using larger register groups: it makes it more difficult to maintain in register all variables simultaneously alive in a sequence and memory spilling may be required.

This pressure can even be augmented by the fact that v0 is used as the mask operand in RVV 1.0, this may also limit the number of register groups: when using masked operations with large register groups, v0 often prevents a program from using the first register group. For example, when LMUL=4, if one wants to use v0 as the mask input, then one cannot use the v0,v1,v2,v3 group. This group can still be partially exploited with LMUL=1 for example to build the v0 mask or partially for LMUL=2 (v2,v3).

The availability of larger inputs may not be well supported by the implementation: for example a LMUL=8 input data set for a vrgather may not be supported with the same latency and/or throughput as a vrgather with LMUL=1, since each result element may originate from a larger pool. Thus even if legal the instruction configuration may not be efficient but the implementation still needs to support it one way or the other.

Conclusion

Vector register groups and the associated length multiplier is an interesting concept which can be leveraged thanks to the vector length agnostic characteristic of RVV where the software build around the vector length parameter can be agnostic of VLEN and thus easily extended to VLEN*LMUL register groups. If you want to learn more on the RISC-V Vector Extension you may consider reading our RVV in a Nutshell blog series starting with part 1.

Resources

"Vector Operands" section of the specification detailing EMUL https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#sec-vec-operands
Mapping of vector elements in vector register groups when LMUL > 1: https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#sec-vec-operands

How to read a RISC-V Vector assembly instruction

In our 5 and a half blog series, RVV in a Nutshell, we presented the basics of RISC-V Vector extension (RVV 1.0), but even after this overview some aspects of this large extension can still seem difficult to apprehend. In this sub-series, we will review in more details some of those aspects. For example, it is difficult to interpret a RVV assembly instruction. Let's review the main component of such instructions and study various examples.

Overview

We will draw some generalities from a practical example: a masked integer vector-scalar addition vadd.vx v12, v3, v4, v0.t.

The following diagram illustrates the 6 main components of any RVV assembly instruction. Most of those components have numerous variants and some of them are optional.

RVV assembly instruction field description

Mnemonic

The first component is the mnemonic which describes which operation should be performed by the instruction (e.g. in our case vadd will perform a vector add while vcompress will compress a bitmask).

The mnemonic often describes the destination type: for example vadd is a vector add with a single-width destination, while vwadd is a widening vector add and vmadc is a vector addition with carry returning a mask of output carries.

Operand type(s)

The second component is the operand type(s). It describes on what type of operand(s) the operation is performed. The most common is .vv (vector-vector) which often means that the instruction admits at least two inputs and both are single-width vectors.

The list of various possibilities includes:

.vv operation between two (or three) single-width vectors, e.g. vmul.vv
.wv operation between two (or three) vectors, vs1 is single-width while vs2 and vd are wide operands/destinations (EEW=2*SEW), e.g. vwmacc.wv
.vx / .vf operation between one or multiple vector and a scalar (general purpose register: x or floating-point register f), e.g. vfadd.vf, the scalar operand is splat to build a vector operand
.wx / .wf: operation between a scalar and a wide second vector operand, e.g. vfwadd.wf
.vi operation between one or multiple vectors and an immediate. The immediate is often 5-bit wide encoded in the rs1/vs1 field, e.g. vsub.vi
.vs: operation between a vector and a single element (scalar) contained in a vector register, e.g. vredsum.vs (the single scalar is used to carry the reduction accumulator)
.vm: operation between a vector and a mask, e.g. vcompress.vm
.v: operation with a single vector input, e.g. vmv2r.v, may also be used for vector loads (which have scalar operands for the address on top of a single data vector operand: vle16.v.
.vvm / .vxm / .vim operation vector-vector / vector-scalar / vector-immediate with a mask operand (e.g. vadc.vvm, addition between two vectors with an extra mask operand constituting an input carry vector)

The conversions constitutes a category of its own for the operand types, because the mnemonic suffix describes: the destination format, the source format, and the type of operand. For example vfcvt.x.f.v is a vector (.v) conversion from floating-point element (.f) to signed integer (.x) result elements. .xu is used to indicate unsigned integers, .rtz is used to indicate a static round-towards-zero rounding mode.

Destination and source(s)

In the assembly instruction, destination and sources follows the mnemonic. The destination is the first register to appears, followed by one or multiple sources.

Each of those element encodes a register group. The destination and source operands register groups are represented by the first register in the group (for example if LMUL=4, then v12 represents the 4-wide register group v12v13v14v15). Thus the actual register group depends on the assembly opcode but also on the value of vtype: it is context sensitive. Most RVV operations have a vector destination, denoted by vd, some may have a scalar destination (e.g. vmv.x.s with a x register destination or vfmv.f.s with a f register destination) and others have a memory destination such as the vector stores, e.g. vse32.v.

There can be one or two sources: vs2 and vs1 for vector-vector instructions. If the operations admits a scalar operand, or an immediate operand then vs1 is replaced by rs1 (respectively imm), e.g. vfadd.vf v4, v2, ft3. Vector loads have a memory source, e.g. vloxei8.v vd, (rs1), vs2 [, vm] which has a scalar register as address source and a vector register as destination source.

RVV defines 3-operand instruction, e.g. vmacc.vv. For those operations the destination register vd is both a source and a destination: the operation is destructive: one of the source operand is going to be overwritten by the result.

Mask operand

Most RVV operation can be masked: in such case the v0 register is used as a mask to determine which elements are active and whose elements of the result will be copied from the old destination value or filled with a pre-determined pattern. RVV 1.0 only supports true bit as active masks: the element is considered active if the bit at the corresponding index is set to 1 in v0, and inactive if it is 0. This is what is encoded by the last operand of our example: v0.t (for v0 "true"). If this last operand is missing, then the operation is unmasked (all body elements are considered active).

More information can be found in this post of the original series: RVV in a nutshell (part 3): operations with and on masks.

Conclusion

We hope this post has shed some lights on the syntax of RISC-V Vector assembly instructions. We will review other concepts related to the vector extension in future posts.

Reference:

https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#101-vector-arithmetic-instruction-encoding

Programming with RISC-V Vector extension: how to build and execute a basic RVV test program (emulation/simulation)

Update(s):

- Jan 16th 2022 adding section on Objdump

RISC-V Vector Extension (RVV) has recently been ratified in its version 1.0 (announcement and specification pdf). The 1.0 milestone is key, it means RVV maturity has reached a stable state: numerous commercial and free implementation of the standard are appearing and software developers can now dedicate significant effort to port and develop library on top of RVV without fear of seeing the specification rug being pulled under their feet. In this article we will review how to build a version of the clang compiler compatible with RVV (v0.10) and to develop, build and execute our first RVV program.

Building the compiler

Before building a compiler for RVV with need a basic riscv toolchain. This toolchain will provide the standard library and some basic tools require to build a functioning binary. The toolchain will be installed under ~/RISCV/ (feel free to adapt this directory to your setup).

# update to the your intended install directory
export RISCV=~/RISCV-TOOLS/

# downloading basic riscv gnu toolchain, providing:
# - runtime environement for riscv64-unknown-elf (libc, ...)
# - spike simulator
git clone https://github.com/riscv-collab/riscv-gnu-toolchain
./configure --prefix=$RISCV
make -j3

Compiling for RVV requires a recent version of clang (this was tested with clang 14).

# downloading llvm-project source from github
git clone https://github.com/llvm/llvm-project.git
cd llvm-project
# configuring build to build llvm and clang in Release mode
# using ninja
# and to use gold as the linker (less RAM required)
# limiting targets to RISCV, and using riscv-gnu-toolchian
# as basis for sysroot
cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;" \
      -DCMAKE_BUILD_TYPE=Release \
      -DDEFAULT_SYSROOT="$RISCV/riscv64-unknown-elf/" \
      -DGCC_INSTALL_PREFIX="$RISCV" \
      -S llvm -B build-riscv/ -DLLVM_TARGETS_TO_BUILD="RISCV"
# building clang/llvm using 4 jobs (can be tuned to your machine)
cmake --build build/ -j4

Building clang/llvm require a large amount of RAM (8GB seems to be the bare minimum, 16GB is best) and will consume a lot of disk space. Those requirements can be reduced by selecting Release build type (rather than the default Debug) and by using gold linker.

This process will generate clang binary in llvm-project/build/bin/clang .

More information on how to download and build clang/llvm can be found on the project github page.
Development.

The easiest way to develop software directly is for RVV is to rely on the rvv intrinsics. This project offers intrinsics for most of the instruction of the extension. The documentation is accessible on github and support is appearing in standard compilers (most notably clang/llvm).

As a first exercise, we will use the SAXPY example from rvv-intrinsic-doc rvv_saxpy.c.

Building the simulator

Let's first build an up-to-date proxy kernel pk:

# downloading and install proxy-kernel for riscv64-unknown-elf
git clone https://github.com/riscv-software-src/riscv-pk.git
cd riscv-v
mkdir build && cd build
make -j4 && make install
../configure --prefix=$RISCV --host=riscv64-unknown-elf

Let's now build the simulator (directly from the top of the master branch, why not !).

git clone https://github.com/riscv-software-src/riscv-isa-sim
cd riscv-isa-sim
mkdir build && cd build
../configure --prefix=$RISCV
make -j4 && make install

Building the program

RVV is supported as part of the experimental extensions of clang. Thus it must be enabled explicitly when executing clang, and it must be associated with a version number, the current master of clang only support v0.10 of the RVV specification.

clang -L $RISCV/riscv64-unknown-elf/lib/ --gcc-toolchain=$RISCV/ \

       rvv_saxpy.c -menable-experimental-extensions -march=rv64gcv0p10 \
      -target riscv64 -O3 -mllvm --riscv-v-vector-bits-min=256 \

       -o test-riscv-clang

Executing

To execute the program we are going to use the spike simulator and the riscv-pk proxy kernel.

Spike is part of the riscv-gnu-toolchain available at https://github.com/riscv-collab/riscv-gnu-toolchain , riscvv-pk is also available on github. https://github.com/riscv-software-src/riscv-pk

the binary image of pk must be the first unnamed argument to spike before the main elf.

$RISCV/bin/spike --isa rv64gcv $RISCV/riscv64-unknown-elf/bin/pk \

                  test-riscv-clang

NOTES: I tried to use riscv-tools (https://github.com/riscv-software-src/riscv-tools) does not seem actively maintain and several issue poped up when I tried building it.

Objdump

Not all objdump support RISC-V vector extension. If you have built llvm has indicated above, you should be able to use the llvm-objdump program built within to disassemble a program with vector instructions.

llvm-objdump -d --mattr=+experimental-v <binary_file>

References

IREE (MLIR dialect) page of riscv-v cross compilation https://google.github.io/iree/building-from-source/riscv/
Official llvm github project https://github.com/llvm/llvm-project
RVV intrinsic documentation on github https://github.com/riscv-non-isa/rvv-intrinsic-doc
https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/GiTkPw-9r8A?pli=1
Properly configuring clang/llvm build https://stackoverflow.com/questions/68580399/using-clang-to-compile-for-risc-v

Assisted assembly development for RISC-V RV32

In this post we will present how the assembly development environment tool (asmde) can ease assembly program development for RISC-V ISA.

You will develop a basic floating-point vector add routine.

Introducing ASMDE

The ASseMbly Development Environment (asmde, https://github.com/nibrunie/asmde) is an open-source set of python utility to help the assembly developper. The main eponym utility, asmde, is a register assignation script. It consumes a templatized assembly source file and fill in variable names with legal register, removing the burden of register allocation from the developper.

Recently, alpha support for RV32 (32-bit version of RISC-V) was added to asmde. We are going to demonstrate how to use it in this post.

Vector-Add testbench

The example we chose to implement is a basic vector add.

/** Basic single-precision vector add
 *  @param dst destination array
 *  @param lhs left-hand side operand array
 *  @param lhs right-hand side operand array
 *  @param n vector sizes
 */
void my_vadd(float* dst, float* lhs, float* rhs, unsigned n);

The program is split in two files:

- a test bench main.c

- an asmde template file vec_add.template.S

Review of the assembly template

The listing below present the input template. It consists in a basic assembly source file extended with some asmde specific constructs.

// testing for basic RISC-V RV32I program
// void vector_add(float* dst, float* src0, float* src1, unsigned n)
//#PREDEFINED(a0, a1, a2, a3)
        .option nopic
        .attribute arch, "rv32i2p0_m2p0_a2p0_f2p0_d2p0"
        .attribute unaligned_access, 0
        .attribute stack_align, 16
        .text
        .align  1
        .globl  my_vadd
        .type   my_vadd, @function
my_vadd:
        // check for early exit condition n == 0
        beq a3, x0, end
loop:
        // load inputs
        flw F(LHS), 0(a1)
        flw F(RHS), 0(a2)
        // operation
        fadd.s F(ACC), F(LHS), F(RHS)
        // store result
        fsw F(ACC), 0(a0)
        // update addresses
        addi a1, a1, 4
        addi a2, a2, 4
        addi a0, a0, 4
        // update loop count
        addi a3, a3, -1
        // branch if not finished
        bne x0, a3, loop
end:
        ret
        .size   my_vadd, .-my_vadd
        .section        .rodata.str1.8,"aMS",@progbits,1

ASMDE Macro

The mandatory comment are followed by an asmde macro PREDEFINED.

This macro indicates to asmde assignator that the argument list of registers should be considered alive when entering the function. It is often used to list function arguments.

ASMDE Variable

The second construct provided by asmde are the assembly variables.

                flw F(LHS), 0(a1)
                flw F(RHS), 0(a2)
                // operation
                fadd.s F(ACC), F(LHS), F(RHS)
                // store result
                fsw F(ACC), 0(a0)

Those variables are of the form <specifier>(<varname>). In this example we use the specifier F for floating-point register variables. The specifiers X or I can be used for integer registers. These variables are used to manipulate (write to / read from) virtual registers. asmde will perform the register assignation, taken into account the instruction semantics and the program structure.

Here for example, we used F(LHS) variable to load an element of the left-hand side vector, F(RHS) to load elements from the right-hand side vector and F(ACC) contains the sum of those two variables which is later stored back into the destination array.

Assembly template translation

asmde can be invoked as follow to generate an assembly file with assigned registers:

python3 asmde.py -S --arch rv32 \
                 examples/riscv/test_rv32_vadd.S \

                --output vadd.S

Building and executing the test program

We can build our toy example alongside a small testbench:

#include <stdio.h>

#ifdef LOCAL_IMPLEMENTATION
void my_vadd(float* dst, float* lhs, float* rhs, unsigned n){
    unsigned i;
    for (i = 0; i < n; ++i)
        dst[i] = lhs[i] + rhs[i];
}
#else
void my_vadd(float* dst, float* lhs, float* rhs, unsigned n);
#endif


int main() {
    float dst[4];
    float a[4] = {1.0f, 2.0f, 3.0f, 4.0f};
    float b[4] = {4.0f, 3.0f, 2.0f, 1.0f};
    my_vadd(dst, a, b, 4);

    int i;
    for (i = 0; i < 4; ++i) {
        if (dst[i] != 5.0f) {
            printf("failure\n");
            return -1;
        }
    }

    printf("success\n");
    return 0;
}

And finally execute it.

(requires rv32 gnu toolchain and a 32-bit proxy kernel pk)

# building test program
$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -o test_vadd vadd.S test_vadd.c
# executing binary
$ spike --isa=RV32gc riscv32-unknown-elf/bin/pk  ./test_vadd

Conclusion

I hope this small example was useful to you and that you will be able to use asmde in your own project.

If you find issues (there are many), you can report them on github https://github.com/nibrunie/asmde/issues/new/choose . If you have some feedback do not hesitate to write a comment here.

Happy hacking with RISC-V.

References:

- asmde github page: https://github.com/nibrunie/asmde

- RISC-V unpriviledged ISA specification

- GNU Toolchain for RISC-V

- Programming with RISC-V vector instructions