Coding Illness: riscv

Showing posts with label riscv. Show all posts

RISC-V Compressed Instructions (part 1): C extension

RISC-V base ISAs (RV32I and RV64I) define 32-bit wide instructions. Those instructions follow the standard RISC instruction set architecture pattern: they fall into a limited set of possible encodings with opcode, register indices and immediate fields always being located in the same bit fields, making it easier to decode. RISC-V instructions can have up to 2 source registers and 1 destination register, with the possibility to use the destination as an extra source (e.g. fmadd). Each register is encoded on a 5-bit index: there are a total of 32 general purpose registers which can be used or written by an instruction (you can find more information on RISC-V register file in this post). This means that up to 15 bits of the instruction word may be dedicated to source/destination encoding, leaving the other 17 bits for immediate and function encodings.

Most programs will often rely heavily on a small subset of instructions which will have a very large static and dynamic frequency compared to all other instructions: static frequency counts the occurrences in the program binary (e.g. from an objdump) while dynamic frequency counts the occurrences during the execution (e.g. from an execution trace). Code structure (loop, functions, ...) and input values affect differently dynamic and static frequencies.

The fact that instructions appear with different frequencies can be exploited by ISA design to reduce code size by encoding the most used instructions on less bits than a general instruction. This can be beneficial for both low power and high performance targets: less memory required to store instructions, more instructions fit in caches, fetching N bytes provides more uops ... with the cost of a more complex decoding and with the caveat that instruction addresses are not aligned to a 32-bit boundary in memory.

Early on RISC-V was extended with compressed instruction extensions. And now multiple standard extensions offer compressed instructions (with encoding less than the initial 32 bits and/or fusing operations from multiple standard instructions). We have divided the survey of RISC-V compressed instructions in two posts: this post reviews some aspects of the first compressed extension, the C extension, while the second one will review the newer Zc* extensions.

RISC-V C extension

To allow code size improvements RISC-V base ISA was rapidly complemented with the Compressed extension, a.k.a the C extension. This extension is defined in Chapter 16 of the unprivileged specification (see pdf).

The C extension defines 9 new instruction formats, each of them fitting on 16-bit. It is not a standalone extensions but is built on top of RV32I or RV64I. It defines a little over 40 new instructions, with some of them only accessible for larger XLEN (RV64 or RV128) or limited to smaller XLEN (RV32FC) and some of them requiring the floating-point extension (F) or the double precision extension (D). Each instruction in the C-extension has an uncompressed counterpart (and can be expanded to this counterpart by the micro-architecture to simplify instruction decoding and execution).

In an assembly program the compressed instructions are easily distinguishable: they start with the prefix "c.", e.g. c.addi is the compressed add with immediate with a non-zero 6-bit immediate. We will review some of the compressed instruction formats and how they compare with their expanded counterparts: arithmetic operations with immediate, arithmetic operations with registers, load/store with focus on load from the stack.

CI-type: compressed operations with immediate

The following diagram illustrates the differences between the encoding of addi in the base ISA (top, 32-bit wide) and in the C extension (bottom, 16-bit wide): the immediate has been reduced from 12 to 6 bits, the same register index is used for both source and destination and the instruction operation is encoded on 5 bits rather than 10. One thing to note is that the register index is still 5-bit wide, all the registers can be addressed in this opcode. For c.addi as for other instances of the CI-type (e.g. compressed load word from stack pointer c.lwsp) the destination must be non-zero (else encoding is reserved). For c.addi, the sign-extended immediate must also be non-zero.

Obviously c.addi is less expressive than addi: smaller immediate range, source and destination have to be the same register. This affects all compressed instructions: they constitutes a subset of their expanded counterparts, hopefully the most useful one.

CA-type: compressed arithmetic instructions

Another compressed format is the CA-type (compressed arithmetic) illustrated by the following diagram:

In the CA-type, one of the source register index (rs1) is fused with destination (rd), and the register indices are now 3-bit wide. Furthermore they do not encode directly a register index, the indirection is presented in the following table (copied from the RVC standard): for example 0b100 encodes x12 in compressed instructions working on general purpose / integer registers. This is why the register indices are labelled with rd', rs1', and rs2' (rather than rd, rs1, and rs2). The encoded registers were selected because they are the most frequently used ones (the specification notes that the RISC-V ABI was even changed to ensure the 8 registers mapped in RVC were among the most used). The following table (copied from RISC-V unprivileged specification chapter 16 section 16.2 on the C extension instruction formats, page 100) provides the mapping between the 8 possible values of encoded register indices in the CA-type and the actual general purpose or floating-point registers, alongside the ABI register names.

Both previous examples of RVC encoding illustrates that compressed instructions are less expressive than their base extension counterparts and use larger parts of the opcode space. This is the price to pay to ensure shorter code and is balanced by the fact that they are more heavily used than the extended versions providing an overall code size reduction.

Compressed load and stores

The C extensions defines 2 types of compressed loads and stores.

The first type of compressed load uses a memory address based on the stack pointer (x2): c.[f]lwsp, c.[f]ldsp, c.lqsp. These instructions use a 6-bit offset scaled by the data size and added to the base address read in the stack pointer register. The destination register index is encoded on 5 bits and the encoding with rd=0 as the destination register is reserved. Since the base address register is always x2 there is no need to encode it in the instruction word. There is no floating-point version of the quad-word load. The store counterparts exist: c.[f]swsp, c[f]sdsp, c.sqsp.

The second type of compressed load/store uses a register-base address: 2 registers are encoded the base address rs1' and the load destination rd' (respectively store source rs2'). Each of those 2 registers is encoded on 3 bits and uses the reduced RVC Register Number encoding listed previously.

Both types of compressed loads and stores uses specific immediate encoding where bit order might look a bit scrambled. According to the specification, the immediate split into various bit fields and their ordering was chosen to ensure that most immediate bits could be found in the same location across various families of instructions.

The following diagram illustrates the difference between the encoding of the lw rd, offset(x2) (I-type, top part) and the c.lwsp rd, offset(sp) (CI type, bottom part). In the compressed encoding, the 5-bit offset is used to encode a 4-byte offset and is implicitly multiplied by 4 before being added to the stack pointer sp/x2. The 5th bit is stored in opcode position 12 as other instructions of type CI and the lower 5-bit immediate field stores the offset [4:2] has its most 3 significant bits and offset [7:6] has its 2 least significant bits. rd cannot be equal to 0 in the compressed instruction encoding.

Control flow instructions: compressed jump and branch

The compressed extension contains 6 control transfers instructions: 2 unconditional jumps with immediate offset (added to pc): c.j and c.jal, 2 unconditional jumps with register targets: c.jr and c.jalr and two branches: c.beqz and c.bnez. Unconditional jumps only exist with an immediate offset or with a direct register target and the branches only exist with an immediate branch offset and with a comparison to zero.

Example of compressed assembly snippet

Let us now consider an illustration of the impact of the C extension on a toy program: a vector add function.

Using godbolt.org compiler explorer (a wonderful tool) we can compile a RV64 program assuming no support for C extension, or assuming it is supported: compiler explorer result.

The example program is simply:

int vector_add(float* r, float* a, float* b, unsigned n) {
    unsigned i = 0;
    for (;i < n; ++i) r[i] = a [i] + b[i];
    return n;
}

The first assembly was compiled by clang 15.0.0 for RV64 with options: "-O2 -march=rv64imfd".

(it also contains a 2-instruction main function)

vector_add:
 beqz	a3,6b4 <vector_add+0x30>
 slli	a4,a3,0x20
 srli	a4,a4,0x20
 flw	ft0,0(a1)
 flw	ft1,0(a2)
 fadd.s	ft0,ft0,ft1
 fsw	ft0,0(a0)
 addi	a1,a1,4
 addi	a2,a2,4
 addi	a4,a4,-1
 addi	a0,a0,4
 bnez	a4,690 <vector_add+0xc>
 mv	a0,a3
 ret

The second assembly was compiled with the same compiler and options: "-O2 -march=rv64imfdc" (the only difference is the addition of the C extension).

vector_add:
 beqz	a3,6a6 <vector_add+0x22>
 slli	a4,a3,0x20
 srli	a4,a4,0x20
 flw	ft0,0(a1)
 flw	ft1,0(a2)
 fadd.s	ft0,ft0,ft1
 fsw	ft0,0(a0)
 addi	a1,a1,4
 addi	a2,a2,4
 addi	a4,a4,-1
 addi	a0,a0,4
 bnez	a4,68c <vector_add+0x8>
 mv	a0,a3
 ret

The two assembly results can look very similar (and they mostly are) but one can notice a few differences. Let's start with the similarities: the instructions mnemonic and the registers used are identical: there are as many instructions in both listings (14) and they use the same registers (same sources and destinations).

On the front of differences: the immediate offset differs (on beqz and bnez): this is because the instruction encoding lengths differ. Let's look at the second compilation result annotated with the program addresses (left, in hexadecimal) and the opcodes (in grey, above the instruction):

Out of the 14 instructions, 9 are encoded on 2 bytes. Without the C extensions the code occupies 56 bytes, and only 38 when the C extension is enabled, representing a 32% saving. The main loop (program from address 0x68c to 0x6a4) contains 9 instructions, including 6 compressed instructions. The asymptotic dynamic saving with compressed instructions is similar to the static code size saving.

One may ask why the compressed version of flw (c.flw) has not been used by the compiler; the reason is simple: this instruction only exists for RV32FC, not for RV64FC.

Another fact is that the 2-byte code alignment requirement for branch target can be witnessed for the target of the first branch beqz which, when taken, branches to address 0x6a6, a 2-byte aligned, but not 4-byte aligned, address.

Conclusion

The C extension provides a reduced encoding for a subset of the base instructions. The subset was selected to cover the most used instructions in their most used form. The reduced encoding allows for code size improvements which benefit both low power and high performance implementations.

For a more in depth overview of the C -extension we invite you to read Andrew Waterman's master thesis report which contains many insights on the impact of compressed instructions on RISC-V program (link to pdf).

References:

RISC-V unprivileged specification v20191213 (including C extension, chapter 16 from page 97): https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf
Repository for RISC-V code size reduction group: https://github.com/riscv/riscv-code-size-reduction

RISC-V Vector Register Groups

The RISC-V Vector Extension (RVV) defines and exploits the concept of vector register groups. This post clarifies what a register group is, how it is useful and the caveat associated with it.

A single vector instruction can operate on operands each consisting of more than one register and can produce similarly extended results.

A group is a bundle of multiple registers. RVV 1.0 supports groups of 1, 2, 4 or 8 registers. The indices of registers in a group span a contiguous range and the lowest index must be a multiple of the group size. For example the vector registers [4, 5, 6, 7] form a valid group for size 4, but [5, 6] is not a valid group for size 2. The lowest index is used to specify the vector register group in an instruction opcode: e.g. when LMUL=4, v4 stands for v4v5v6v7. Thus the actual sizes of the operands (and destination) of an instruction depends on the context (the vtype configuration as we will see later): vadd.vv v4, v0, v12 may operate on 1, 2 or 4 wide register groups depending on the context (8 is not a legal size as neither v4 nor v12 are aligned on an index multiple of 8).

The size of a vector register group is called the vector length multiplier (a.k.a LMUL) in RVV specification: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#342-vector-register-grouping-vlmul20.

For an operation, LMUL is set by the vlmul field in vtype (so it is usually modified through vset instructions). The following RISC-V assembly sequence first defines LMUL as 8 and then executes a double precision vector addition.

	vsetvli	x1, zero, e64, m8, ta, mu
	vfadd.vv	v16, v8, v24

Since the LMUL value is 8, the actual sequence of operation is equivalent to the following (assuming each instruction operates on a single-register wide group):

	vfadd.vv	v16, v8,  v24

	vfadd.vv	v17, v9,  v25

	vfadd.vv	v18, v10, v26

	vfadd.vv	v19, v11, v27

	vfadd.vv	v20, v12, v28

	vfadd.vv	v21, v13, v29

	vfadd.vv	v22, v14, v30

	vfadd.vv	v23, v15, v31

LMUL encoding in vtype

In RVV 1.0, vlmul can take 4 values for integral LMUL: 0b000 (LMUL=1), 0b001 (LMUL=2), 0b010 (LMUL=4), 0b011 (LMUL=8), it can also take 3 fractional values: 0b111 (LMUL=1/2), 0b110 (LMUL=1/4), 0b101 (LMUL=1/8), but more on that in a future post.

LMUL vs EMUL

The actual length multiplier used for an operation operand or destination is called EMUL, for effetctive LMUL. EMUL depends on vlmul, but also on the operation. For example EMUL=LMUL for any source or destination of a vadd.vv, but EMUL=2*LMUL for both vd and vs2 of vwadd.wv (because of its widening characteristic). EMUL can never be larger than 8 (it is not possible for a vector register group to exceed 8 registers) so LMUL=8 is reserved for widening instructions: it is not supported in RVV 1.0.

Some operations which embed the effective element width (EEW) in their opcode define EMUL to keep the following ratio equality EEW/EMUL=SEW/LMUL. For example if LMUL=1, SEW=8-bit, then vle32 admits EEW=32-bit and EMUL=EEW*LMUL/SEW=4, this also mean the register group destination will have to be described by a register whose index is a multiple of 4 (EMUL).

Since EMUL can differ between an instructions operands and it results, partial or complete overlap between register groups are a possibility. RVV 1.0 authorizes 3 types of overlap between source and destination register groups. Those overlaps are specified here https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#52-vector-operands and illustrated by the Figure below (e.g. of v0v1v2v3 register group in various source/destination EEW configurations).

Benefits of register grouping

There are several benefits associated to the availability of register groups:

even if architectural registers are "only" VLEN-bit wide, single instructions can perform operation on larger operands and produce larger results:

for widening operation results or narrowing operation source, having the possibility of building a register groups enables the use of a full register source or enables populating a full register destination
providing large input sets: for example for a vrgather the input data set contains VLMAX elements, where VLMAX=VLEN/SEW*LMUL increases with LMUL: one can build larger lookup table by using larger register groups.
allowing reuse of operands between extended registers (e.g. immediate or scalar operand)
reducing code size: for data parallel section with vector lengths larger than what can fit in a single vector register, less instructions are required to operate on the same amount of data.
reducing number of loop iterations: since VLMAX increases with LMUL, during loop stripmining, less iterations are required when LMUL increases.

exposing more opportunity to micro-architecture

even if a micro-architecture can not operate directly on register groups larger than VLEN-bit wide, it can still exploit micro-op parallelism and chaining opportunities exposed by larger register groups.

Constraints induced by register grouping

One of the caveat of register grouping is that the number of architectural registers decreases with the length multiplier value. For example there are only 4 8-wide register groups: v0, v8, v16, v24. This means that register allocation pressure can be a limiting factor when using larger register groups: it makes it more difficult to maintain in register all variables simultaneously alive in a sequence and memory spilling may be required.

This pressure can even be augmented by the fact that v0 is used as the mask operand in RVV 1.0, this may also limit the number of register groups: when using masked operations with large register groups, v0 often prevents a program from using the first register group. For example, when LMUL=4, if one wants to use v0 as the mask input, then one cannot use the v0,v1,v2,v3 group. This group can still be partially exploited with LMUL=1 for example to build the v0 mask or partially for LMUL=2 (v2,v3).

The availability of larger inputs may not be well supported by the implementation: for example a LMUL=8 input data set for a vrgather may not be supported with the same latency and/or throughput as a vrgather with LMUL=1, since each result element may originate from a larger pool. Thus even if legal the instruction configuration may not be efficient but the implementation still needs to support it one way or the other.

Conclusion

Vector register groups and the associated length multiplier is an interesting concept which can be leveraged thanks to the vector length agnostic characteristic of RVV where the software build around the vector length parameter can be agnostic of VLEN and thus easily extended to VLEN*LMUL register groups. If you want to learn more on the RISC-V Vector Extension you may consider reading our RVV in a Nutshell blog series starting with part 1.

Resources

"Vector Operands" section of the specification detailing EMUL https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#sec-vec-operands
Mapping of vector elements in vector register groups when LMUL > 1: https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#sec-vec-operands

How to read a RISC-V Vector assembly instruction

In our 5 and a half blog series, RVV in a Nutshell, we presented the basics of RISC-V Vector extension (RVV 1.0), but even after this overview some aspects of this large extension can still seem difficult to apprehend. In this sub-series, we will review in more details some of those aspects. For example, it is difficult to interpret a RVV assembly instruction. Let's review the main component of such instructions and study various examples.

Overview

We will draw some generalities from a practical example: a masked integer vector-scalar addition vadd.vx v12, v3, v4, v0.t.

The following diagram illustrates the 6 main components of any RVV assembly instruction. Most of those components have numerous variants and some of them are optional.

RVV assembly instruction field description

Mnemonic

The first component is the mnemonic which describes which operation should be performed by the instruction (e.g. in our case vadd will perform a vector add while vcompress will compress a bitmask).

The mnemonic often describes the destination type: for example vadd is a vector add with a single-width destination, while vwadd is a widening vector add and vmadc is a vector addition with carry returning a mask of output carries.

Operand type(s)

The second component is the operand type(s). It describes on what type of operand(s) the operation is performed. The most common is .vv (vector-vector) which often means that the instruction admits at least two inputs and both are single-width vectors.

The list of various possibilities includes:

.vv operation between two (or three) single-width vectors, e.g. vmul.vv
.wv operation between two (or three) vectors, vs1 is single-width while vs2 and vd are wide operands/destinations (EEW=2*SEW), e.g. vwmacc.wv
.vx / .vf operation between one or multiple vector and a scalar (general purpose register: x or floating-point register f), e.g. vfadd.vf, the scalar operand is splat to build a vector operand
.wx / .wf: operation between a scalar and a wide second vector operand, e.g. vfwadd.wf
.vi operation between one or multiple vectors and an immediate. The immediate is often 5-bit wide encoded in the rs1/vs1 field, e.g. vsub.vi
.vs: operation between a vector and a single element (scalar) contained in a vector register, e.g. vredsum.vs (the single scalar is used to carry the reduction accumulator)
.vm: operation between a vector and a mask, e.g. vcompress.vm
.v: operation with a single vector input, e.g. vmv2r.v, may also be used for vector loads (which have scalar operands for the address on top of a single data vector operand: vle16.v.
.vvm / .vxm / .vim operation vector-vector / vector-scalar / vector-immediate with a mask operand (e.g. vadc.vvm, addition between two vectors with an extra mask operand constituting an input carry vector)

The conversions constitutes a category of its own for the operand types, because the mnemonic suffix describes: the destination format, the source format, and the type of operand. For example vfcvt.x.f.v is a vector (.v) conversion from floating-point element (.f) to signed integer (.x) result elements. .xu is used to indicate unsigned integers, .rtz is used to indicate a static round-towards-zero rounding mode.

Destination and source(s)

In the assembly instruction, destination and sources follows the mnemonic. The destination is the first register to appears, followed by one or multiple sources.

Each of those element encodes a register group. The destination and source operands register groups are represented by the first register in the group (for example if LMUL=4, then v12 represents the 4-wide register group v12v13v14v15). Thus the actual register group depends on the assembly opcode but also on the value of vtype: it is context sensitive. Most RVV operations have a vector destination, denoted by vd, some may have a scalar destination (e.g. vmv.x.s with a x register destination or vfmv.f.s with a f register destination) and others have a memory destination such as the vector stores, e.g. vse32.v.

There can be one or two sources: vs2 and vs1 for vector-vector instructions. If the operations admits a scalar operand, or an immediate operand then vs1 is replaced by rs1 (respectively imm), e.g. vfadd.vf v4, v2, ft3. Vector loads have a memory source, e.g. vloxei8.v vd, (rs1), vs2 [, vm] which has a scalar register as address source and a vector register as destination source.

RVV defines 3-operand instruction, e.g. vmacc.vv. For those operations the destination register vd is both a source and a destination: the operation is destructive: one of the source operand is going to be overwritten by the result.

Mask operand

Most RVV operation can be masked: in such case the v0 register is used as a mask to determine which elements are active and whose elements of the result will be copied from the old destination value or filled with a pre-determined pattern. RVV 1.0 only supports true bit as active masks: the element is considered active if the bit at the corresponding index is set to 1 in v0, and inactive if it is 0. This is what is encoded by the last operand of our example: v0.t (for v0 "true"). If this last operand is missing, then the operation is unmasked (all body elements are considered active).

More information can be found in this post of the original series: RVV in a nutshell (part 3): operations with and on masks.

Conclusion

We hope this post has shed some lights on the syntax of RISC-V Vector assembly instructions. We will review other concepts related to the vector extension in future posts.

Reference:

https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#101-vector-arithmetic-instruction-encoding

RISC-V Register Files

RISC-V ISA defines several register files. There are at least 3 in the main set of extensions: the general purpose register file (XRF) introduced in the base integer extensions, the floating-point register file (FRF) introduced in the floating-point extensions and the vector register file (you guessed it, it was introduced in the vector extension a.k.a. RVV). We will not consider the control and status registers (CSR file) which have their own specificity (it is more common to split system registers from general purpose registers, although commonality could be debated).

Register files characteristics

Each register file contains 32 architectural registers. The first register of the general purpose register file (x0) is a bit specific since its value is a hardwired constant: 0. The first register of the vector register file (v0) is the only one which can be used as the mask operand in RVV 1.0 (more information on RVV masked operation can be found in RISC-V Vector Extension in a Nutshell: part 3).

The size of the registers in each file is an architecture parameter: XLEN for the general purpose register file, FLEN for the floating-point register file and VLEN for the vector register file.

The general purpose register file is sized to fit a virtual address value. For example for the base 32-bit RV32I ISA, the general purpose registers are 32-bit wide, while they are 64-bit wide for RV64I (and 128-bit wide for RV128I, although this architecture is seldom used). For the vectore registers, VLEN must be a power of 2, greater or equal to ELEN (maximum element width supported by the implementation) and must not exceed 65536 (2^16).

Moving data between register files

The diagram at the beginning of this post illustrates the base characteristics of each register file and the basic move operations between them and from/to memory. For the XRF and FRF, only operations on 32-bit values are drawn. There exists similar operation for double precisions: e.g. in RV64D when XLEN=FLEN=64 bits, which adds fld, fsd, fmv.d.x, fmv.x.d and numerous conversions from integer format to double precision and reverse (e.g. in RV64IFD fcvt.d.wu f3, x2 corresponds to converting from an unsigned 32-bit integer in the bottom 32-bit of the x2 general purpose register to a double precision number stored in the f3 floating-point register).

For the vector register file, the scalar data size and the vector element size are not encoded as part of the opcode but are configured in the vsew field of the vtype configuration register, so there are no need for type specific vector moves. The diagram only represents explicit data moves between FRF/XRF and VRF but most vector instructions admit a vector-scalar variant which reads one of its operand directly from XRF or FRF (e.g. vfmadd.vf splats a scalar floating-point register as the multiplier). For more details on the vector registers and the vector extension in general you can refer to the series: RISC-V Vector extension in a Nutshell published on this blog.

Why multiple register files ?

This discussion focuses on the general purpose and floating-point register files. It is much more understandable to have a different vector register files (vector register tends to be larger than the other types of registers) although some ISAs (e.g. x86 SSE and AVX extensions) reuse small registers as the low parts of larger registers, this is not the case in RISC-V, where general purpose, floating-point and vector registers do not overlap.

NOTE: The option of overlapping FRF and VRF was considered and dropped during the specification process of RVV. See https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#51-scalar-operands

Having multiple register files has some advantages and some drawbacks. Let's start with a benefit.

The first benefit is that the architecture can expose more architectural registers without extending the size of a register index in the instruction encoding. The type of instruction (.e.g. floating-point addition) is encoded in the opcode, and all (most) floating-point instructions implicitly operate on floating-point registers, thus there is no need to distinguish floating-point and general purpose registers in the opcode encoding of the register itself. The number of available architectural registers impact the register allocation pressure when writing (in assembly) or compiling a program: more registers often means less flexible ABI, less spilling.

In general, RISC-V ISA uses 5 bits to encode the register index (for operands and results), which provides 32 registers, since the type of register is part of the opcode specification rather than the register index, RISC-V architecture can in fact have 96 registers: 32 general purpose registers, 32 floating-point registers and 32 vector registers (you are authorized to say only 95 and exclude the specific x0, although having a hardwired 0-value operand is quite useful and certainly outweighs the benefit of an extra general purpose register in most cases).

This benefit can also be considered a drawback since more registers also means higher hardware cost for implementations and more registers to save in case of context switch. This was solved for RISC-V by introducing the Zfinx/Zdinx/Zhinx extensions which define floating-point operations working on general purpose registers (thus limiting the cost of floating-point support for the more constrained implementations: high performance out-of-order implementation with very large physical register files are generally less concerned by limiting the size of the architectural register files, although it can impact mapping tables, context sizes ...).

Another benefit of separate register files is that register can be encoded to optimize the data they store: for example the RISC-V open-source processor rocket-chip uses hardfloat's specific encoding of floating-point numbers (see Recoded Format section here) to makes floating-point operations more efficient (simplifying the detection of special values and reducing the encoding difference between normal and subnormal numbers). The use of a format specific encoding is facilitated by the fact that data moves are explicit (you need to execute a fmv.f.x to move the content of a general purpose register to a floating-point register before performing any operation on it): the recoding of value can be performed during the explicit data move (including from/to memory) and the recoding can exploit the fact that the value type is determined by the operation acting on it. Such recoding can only leave within an internal register and values must be converted to canonical formats when moved to another register file or to memory.

Once again this can also be considered a drawback since you have to explicitly move data from one register file to the next (which may have a non-zero latency and uses encoding space) and that you need to define separate memory operations for each register files: loading a 32-bit single precision number is not the same operation as loading a 32-bit integer value, since the destination registers differ. This was surveyed by the diagram and the section Moving data between register files. This drawback is more easily alleviated by wide out-of-order implementations which can extract more ILP and cover the cost of those explicit moves (although this cost still impacts latency dominated chains of instructions). It is also removed in the Z(f/d/h)inx extensions.

Another advantage of multiple register files is that you can tune the file's architecture characteristics to the domain you want to support. RISC-V does not expose a configurable number of registers but you XLEN, FLEN and VLEN are defined separately (the first two depending on which extensions are enabled): you can have a 32-bit ISA with 64-bit double precision registers: the architecture does not have to extend its integer registers to 64 bits, while still keeping 64 registers with adapted sizes: general purpose/integer instructions can be efficient and low power while the core only activates the FRF for workloads which require the extra activity.

Finally an implementation advantage, which was pointed out by a colleague (Alex S.): having several register files reduces the number of read / write ports per register file while being able to serve as many execution pipelines more efficiently. High performance cores often have a lot of ports on each register files (more than ten read ports may not be uncommon) that serve a lot of different execution pipelines in parallel. With the same total number of execution pipelines multiplying the number of architectural register file helps to split physical register files accordingly, decreasing the number of port per file. This is a good benefit as the complexity of a register file increases rapidly with the number of ports (in particular read ports). Even for low performance implementations, limiting the number of ports per file provide more efficient register files.

Conclusion

In this post we reviewed the main register files specified by RISC-V ISA, their basic characteristics and how they interact together. We listed some of the reasons for this choice alongside some of the drawbacks of specializing register files.

Initially published Oct 17th 2022, updated Oct 19th 2022.

Thanks

Thank you to Alex S. for pointing out a key advantage (decreasing the number of ports per register file) I missed in the first version of this blog post.

References:

RISC-V specification version 2.2 https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf
RISC-V Zfinx extension specification: https://github.com/riscv/riscv-zfinx/raw/main/zfinx-1.0.0-rc.pdf
RISC-V RocketChip core generator: https://github.com/chipsalliance/rocket-chip

RISC-V Vector Extension in a Nutshell (Part 1)

This post (and many others from this blog) is now available on substack https://fprox.substack.com/p/risc-v-vector-extension-in-a-nutshell-part-1.

New option for RISC-V Vector performance simulation

There have been some news since I published Performance simulation of RISC-V Vector

Rivos Inc (a recent contender in the RISC-V race) has been working on extending GEM5 to support RVV.

More info are available on this post of the gem5-dev mailing list and the source code is accessible on Rivos's github page: https://github.com/rivosinc/gem5/commits/rivos/dev/joy/initial_RVV_support.

It seems their port is not directly related to Cristobal Ramirez / PCLT effort but you should still be able to follow the direction given in my initial article.

Performance simulation of RISC-V vector extension using GEM5

We will be using a fork of Cristobal Ramirez Lazo's gem5 fork namely https://github.com/plctlab/plct-gem5, this fork is quite active and has been updated to RVV 1.0 . The initial fork extends GEM5 with RVV support and a vector processing unit extension (and its GEM5 configuration).

Building gem5 for RISC-V

gem5 has some external dependencies which must be installed before the build

# on ubuntu 20.04
sudo apt install build-essential git m4 scons zlib1g zlib1g-dev \
    libprotobuf-dev protobuf-compiler libprotoc-dev libgoogle-perftools-dev \
    python3-dev python3-six python-is-python3 libboost-all-dev pkg-config

More Information on building GEM5 can be found here: https://www.gem5.org/documentation/general_docs/building .

Then the actual build can be executed:

git clone https://github.com/plctlab/plct-gem5.git
cd plct-gem5
# the -j option set the number of parallel jobs for the build
# the actually this option can be tuned to your setting.
scons build/RISCV/gem5.opt -j3  --gold-linker

Execution

As described in the project README.md, execution a program is straightforward:

${GEM5_DIR}build/RISCV/gem5.opt ${GEM5_DIR}configs/example/riscv_vector_engine.py \
                                --cmd="$program $program_args"

Note

This fork is still under development and some instructions are not supported yet.

References:

gem5 fork by PCLTLAB (still active) with RVV support https://github.com/plctlab/plct-gem5
Cristobal Ramirez Lazo's gem5 fork with initial risc-v vector support https://github.com/RALC88/gem5/tree/develop
gem5 source code repository mirror on github: https://github.com/gem5/gem5
gem5 project task to track RISCV-V vector upstream support: https://gem5.atlassian.net/browse/GEM5-618
Tutorial on GEM5 https://www.cs.sfu.ca/~ashriram/Courses/CS7ARCH/tutorials/gem5/index.html

Programming with RISC-V Vector extension: how to build and execute a basic RVV test program (emulation/simulation)

Update(s):

- Jan 16th 2022 adding section on Objdump

RISC-V Vector Extension (RVV) has recently been ratified in its version 1.0 (announcement and specification pdf). The 1.0 milestone is key, it means RVV maturity has reached a stable state: numerous commercial and free implementation of the standard are appearing and software developers can now dedicate significant effort to port and develop library on top of RVV without fear of seeing the specification rug being pulled under their feet. In this article we will review how to build a version of the clang compiler compatible with RVV (v0.10) and to develop, build and execute our first RVV program.

Building the compiler

Before building a compiler for RVV with need a basic riscv toolchain. This toolchain will provide the standard library and some basic tools require to build a functioning binary. The toolchain will be installed under ~/RISCV/ (feel free to adapt this directory to your setup).

# update to the your intended install directory
export RISCV=~/RISCV-TOOLS/

# downloading basic riscv gnu toolchain, providing:
# - runtime environement for riscv64-unknown-elf (libc, ...)
# - spike simulator
git clone https://github.com/riscv-collab/riscv-gnu-toolchain
./configure --prefix=$RISCV
make -j3

Compiling for RVV requires a recent version of clang (this was tested with clang 14).

# downloading llvm-project source from github
git clone https://github.com/llvm/llvm-project.git
cd llvm-project
# configuring build to build llvm and clang in Release mode
# using ninja
# and to use gold as the linker (less RAM required)
# limiting targets to RISCV, and using riscv-gnu-toolchian
# as basis for sysroot
cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;" \
      -DCMAKE_BUILD_TYPE=Release \
      -DDEFAULT_SYSROOT="$RISCV/riscv64-unknown-elf/" \
      -DGCC_INSTALL_PREFIX="$RISCV" \
      -S llvm -B build-riscv/ -DLLVM_TARGETS_TO_BUILD="RISCV"
# building clang/llvm using 4 jobs (can be tuned to your machine)
cmake --build build/ -j4

Building clang/llvm require a large amount of RAM (8GB seems to be the bare minimum, 16GB is best) and will consume a lot of disk space. Those requirements can be reduced by selecting Release build type (rather than the default Debug) and by using gold linker.

This process will generate clang binary in llvm-project/build/bin/clang .

More information on how to download and build clang/llvm can be found on the project github page.
Development.

The easiest way to develop software directly is for RVV is to rely on the rvv intrinsics. This project offers intrinsics for most of the instruction of the extension. The documentation is accessible on github and support is appearing in standard compilers (most notably clang/llvm).

As a first exercise, we will use the SAXPY example from rvv-intrinsic-doc rvv_saxpy.c.

Building the simulator

Let's first build an up-to-date proxy kernel pk:

# downloading and install proxy-kernel for riscv64-unknown-elf
git clone https://github.com/riscv-software-src/riscv-pk.git
cd riscv-v
mkdir build && cd build
make -j4 && make install
../configure --prefix=$RISCV --host=riscv64-unknown-elf

Let's now build the simulator (directly from the top of the master branch, why not !).

git clone https://github.com/riscv-software-src/riscv-isa-sim
cd riscv-isa-sim
mkdir build && cd build
../configure --prefix=$RISCV
make -j4 && make install

Building the program

RVV is supported as part of the experimental extensions of clang. Thus it must be enabled explicitly when executing clang, and it must be associated with a version number, the current master of clang only support v0.10 of the RVV specification.

clang -L $RISCV/riscv64-unknown-elf/lib/ --gcc-toolchain=$RISCV/ \

       rvv_saxpy.c -menable-experimental-extensions -march=rv64gcv0p10 \
      -target riscv64 -O3 -mllvm --riscv-v-vector-bits-min=256 \

       -o test-riscv-clang

Executing

To execute the program we are going to use the spike simulator and the riscv-pk proxy kernel.

Spike is part of the riscv-gnu-toolchain available at https://github.com/riscv-collab/riscv-gnu-toolchain , riscvv-pk is also available on github. https://github.com/riscv-software-src/riscv-pk

the binary image of pk must be the first unnamed argument to spike before the main elf.

$RISCV/bin/spike --isa rv64gcv $RISCV/riscv64-unknown-elf/bin/pk \

                  test-riscv-clang

NOTES: I tried to use riscv-tools (https://github.com/riscv-software-src/riscv-tools) does not seem actively maintain and several issue poped up when I tried building it.

Objdump

Not all objdump support RISC-V vector extension. If you have built llvm has indicated above, you should be able to use the llvm-objdump program built within to disassemble a program with vector instructions.

llvm-objdump -d --mattr=+experimental-v <binary_file>

References

IREE (MLIR dialect) page of riscv-v cross compilation https://google.github.io/iree/building-from-source/riscv/
Official llvm github project https://github.com/llvm/llvm-project
RVV intrinsic documentation on github https://github.com/riscv-non-isa/rvv-intrinsic-doc
https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/GiTkPw-9r8A?pli=1
Properly configuring clang/llvm build https://stackoverflow.com/questions/68580399/using-clang-to-compile-for-risc-v

Assisted assembly development for RISC-V RV32

In this post we will present how the assembly development environment tool (asmde) can ease assembly program development for RISC-V ISA.

You will develop a basic floating-point vector add routine.

Introducing ASMDE

The ASseMbly Development Environment (asmde, https://github.com/nibrunie/asmde) is an open-source set of python utility to help the assembly developper. The main eponym utility, asmde, is a register assignation script. It consumes a templatized assembly source file and fill in variable names with legal register, removing the burden of register allocation from the developper.

Recently, alpha support for RV32 (32-bit version of RISC-V) was added to asmde. We are going to demonstrate how to use it in this post.

Vector-Add testbench

The example we chose to implement is a basic vector add.

/** Basic single-precision vector add
 *  @param dst destination array
 *  @param lhs left-hand side operand array
 *  @param lhs right-hand side operand array
 *  @param n vector sizes
 */
void my_vadd(float* dst, float* lhs, float* rhs, unsigned n);

The program is split in two files:

- a test bench main.c

- an asmde template file vec_add.template.S

Review of the assembly template

The listing below present the input template. It consists in a basic assembly source file extended with some asmde specific constructs.

// testing for basic RISC-V RV32I program
// void vector_add(float* dst, float* src0, float* src1, unsigned n)
//#PREDEFINED(a0, a1, a2, a3)
        .option nopic
        .attribute arch, "rv32i2p0_m2p0_a2p0_f2p0_d2p0"
        .attribute unaligned_access, 0
        .attribute stack_align, 16
        .text
        .align  1
        .globl  my_vadd
        .type   my_vadd, @function
my_vadd:
        // check for early exit condition n == 0
        beq a3, x0, end
loop:
        // load inputs
        flw F(LHS), 0(a1)
        flw F(RHS), 0(a2)
        // operation
        fadd.s F(ACC), F(LHS), F(RHS)
        // store result
        fsw F(ACC), 0(a0)
        // update addresses
        addi a1, a1, 4
        addi a2, a2, 4
        addi a0, a0, 4
        // update loop count
        addi a3, a3, -1
        // branch if not finished
        bne x0, a3, loop
end:
        ret
        .size   my_vadd, .-my_vadd
        .section        .rodata.str1.8,"aMS",@progbits,1

ASMDE Macro

The mandatory comment are followed by an asmde macro PREDEFINED.

This macro indicates to asmde assignator that the argument list of registers should be considered alive when entering the function. It is often used to list function arguments.

ASMDE Variable

The second construct provided by asmde are the assembly variables.

                flw F(LHS), 0(a1)
                flw F(RHS), 0(a2)
                // operation
                fadd.s F(ACC), F(LHS), F(RHS)
                // store result
                fsw F(ACC), 0(a0)

Those variables are of the form <specifier>(<varname>). In this example we use the specifier F for floating-point register variables. The specifiers X or I can be used for integer registers. These variables are used to manipulate (write to / read from) virtual registers. asmde will perform the register assignation, taken into account the instruction semantics and the program structure.

Here for example, we used F(LHS) variable to load an element of the left-hand side vector, F(RHS) to load elements from the right-hand side vector and F(ACC) contains the sum of those two variables which is later stored back into the destination array.

Assembly template translation

asmde can be invoked as follow to generate an assembly file with assigned registers:

python3 asmde.py -S --arch rv32 \
                 examples/riscv/test_rv32_vadd.S \

                --output vadd.S

Building and executing the test program

We can build our toy example alongside a small testbench:

#include <stdio.h>

#ifdef LOCAL_IMPLEMENTATION
void my_vadd(float* dst, float* lhs, float* rhs, unsigned n){
    unsigned i;
    for (i = 0; i < n; ++i)
        dst[i] = lhs[i] + rhs[i];
}
#else
void my_vadd(float* dst, float* lhs, float* rhs, unsigned n);
#endif


int main() {
    float dst[4];
    float a[4] = {1.0f, 2.0f, 3.0f, 4.0f};
    float b[4] = {4.0f, 3.0f, 2.0f, 1.0f};
    my_vadd(dst, a, b, 4);

    int i;
    for (i = 0; i < 4; ++i) {
        if (dst[i] != 5.0f) {
            printf("failure\n");
            return -1;
        }
    }

    printf("success\n");
    return 0;
}

And finally execute it.

(requires rv32 gnu toolchain and a 32-bit proxy kernel pk)

# building test program
$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -o test_vadd vadd.S test_vadd.c
# executing binary
$ spike --isa=RV32gc riscv32-unknown-elf/bin/pk  ./test_vadd

Conclusion

I hope this small example was useful to you and that you will be able to use asmde in your own project.

If you find issues (there are many), you can report them on github https://github.com/nibrunie/asmde/issues/new/choose . If you have some feedback do not hesitate to write a comment here.

Happy hacking with RISC-V.

References:

- asmde github page: https://github.com/nibrunie/asmde

- RISC-V unpriviledged ISA specification

- GNU Toolchain for RISC-V

- Programming with RISC-V vector instructions