RISC-V Vector Register Groups

 The RISC-V Vector Extension (RVV) defines and exploits the concept of vector register groups. This post clarifies what a register group is, how it is useful and the caveat associated with it.

A single vector instruction can operate on operands each consisting of more than one register and can  produce similarly extended results.

 A group is a bundle of multiple registers. RVV 1.0 supports groups of 1, 2, 4 or 8 registers. The indices of registers in a group span a contiguous range and the lowest index must be a multiple of the group size. For example the vector registers [4, 5, 6, 7] form a valid group for size 4, but [5, 6] is not a valid group for size 2. The lowest index is used to specify the vector register group in an instruction opcode: e.g. when LMUL=4, v4 stands for v4v5v6v7. Thus the actual sizes of the operands (and destination) of an instruction depends on the context (the vtype configuration as we will see later): vadd.vv v4, v0, v12 may operate on 1, 2 or 4 wide register groups depending on the context (8 is not a legal size as neither v4 nor v12 are aligned on an index multiple of 8).

The size of a vector register group is called the vector length multiplier (a.k.a LMUL) in RVV specification: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#342-vector-register-grouping-vlmul20

For an operation, LMUL is set by the vlmul field in vtype (so it is usually modified through vset instructions). The following RISC-V assembly sequence first defines LMUL as 8 and then executes a double precision vector addition.

	vsetvli	x1, zero, e64, m8, ta, mu
	vfadd.vv	v16, v8, v24

Since the LMUL value is 8, the actual sequence of operation is equivalent to the following (assuming each instruction operates on a single-register wide group):

	vfadd.vv	v16, v8,  v24
	vfadd.vv	v17, v9,  v25
	vfadd.vv	v18, v10, v26
	vfadd.vv	v19, v11, v27
	vfadd.vv	v20, v12, v28
	vfadd.vv	v21, v13, v29
	vfadd.vv	v22, v14, v30
	vfadd.vv	v23, v15, v31

LMUL encoding in vtype

In RVV 1.0, vlmul can take 4 values for integral LMUL: 0b000 (LMUL=1), 0b001 (LMUL=2), 0b010 (LMUL=4), 0b011 (LMUL=8), it can also take 3 fractional values: 0b111 (LMUL=1/2), 0b110 (LMUL=1/4), 0b101 (LMUL=1/8), but more on that in a future post.

LMUL vs EMUL

The actual length multiplier used for an operation operand or destination is called EMUL, for effetctive LMUL. EMUL depends on vlmul, but also on the operation. For example EMUL=LMUL for any source or destination of a vadd.vv, but EMUL=2*LMUL for both vd and vs2 of vwadd.wv (because of its widening characteristic). EMUL can never be larger than 8 (it is not possible for a vector register group to exceed 8 registers) so LMUL=8 is reserved for widening instructions: it is not supported in RVV 1.0.

Some operations which embed the effective element width (EEW) in their opcode define EMUL to keep the following ratio equality EEW/EMUL=SEW/LMUL. For example if LMUL=1, SEW=8-bit, then vle32 admits EEW=32-bit and EMUL=EEW*LMUL/SEW=4, this also mean the register group destination will have to be described by a register whose index is a multiple of 4 (EMUL).

Since EMUL can differ between an instructions operands and it results, partial or complete overlap between register groups are a possibility. RVV 1.0 authorizes 3 types of overlap between source and destination register groups. Those overlaps are specified here https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#52-vector-operands and illustrated by the Figure below (e.g. of v0v1v2v3 register group in various source/destination EEW configurations).


Benefits of register grouping

There are several benefits associated to the availability of register groups:
  • even if architectural registers are "only" VLEN-bit wide, single instructions can perform operation on larger operands and produce larger results:
    • for widening operation results or narrowing operation source, having the possibility of building a register groups enables the use of a full register source or enables populating a full register destination
    • providing large input sets: for example for a vrgather the input data set contains VLMAX elements, where VLMAX=VLEN/SEW*LMUL increases with LMUL: one can build larger lookup table by using larger register groups.
    • allowing reuse of operands between extended registers (e.g. immediate or scalar operand)
    • reducing code size: for data parallel section with vector lengths larger than what can fit in a single vector register, less instructions are required to operate on the same amount of data. 
    • reducing number of loop iterations: since VLMAX increases with LMUL, during loop stripmining, less iterations are required when LMUL increases.
  • exposing more opportunity to micro-architecture
    • even if a micro-architecture can not operate directly on register groups larger than VLEN-bit wide, it can still exploit micro-op parallelism and chaining opportunities exposed by larger register groups.

Constraints induced by register grouping

One of the caveat of register grouping is that the number of architectural registers decreases with the length multiplier value. For example there are only 4 8-wide register groups: v0, v8, v16, v24. This means that register allocation pressure can be a limiting factor when using larger register groups: it makes it more difficult to maintain in register all variables simultaneously alive in a sequence and memory spilling may be required.

This pressure can even be augmented by the fact that v0 is used as the mask operand in RVV 1.0, this may also limit the number of register groups: when using masked operations with large register groups, v0 often prevents a program from using the first register group. For example, when LMUL=4, if one wants to use v0 as the mask input, then one cannot use the v0,v1,v2,v3 group. This group can still be partially exploited with LMUL=1 for example to build the v0 mask or partially for LMUL=2 (v2,v3).

The availability of larger inputs may not be well supported by the implementation: for example a LMUL=8 input data set for a vrgather may not be supported with the same latency and/or throughput as a vrgather with LMUL=1, since each result element may originate from a larger pool. Thus even if legal the instruction configuration may not be efficient but the implementation still needs to support it one way or the other.

Conclusion

Vector register groups and the associated length multiplier is an interesting concept which can be leveraged thanks to the vector length agnostic characteristic of RVV where the software build around the vector length parameter can be agnostic of VLEN and thus easily extended to VLEN*LMUL register groups. If you want to learn more on the RISC-V Vector Extension you may consider reading our RVV in a Nutshell blog series starting with part 1.

Resources




How to read a RISC-V Vector assembly instruction

  In our 5 and a half blog series, RVV in a Nutshell, we presented the basics of RISC-V Vector extension (RVV 1.0), but even after this overview some aspects of this large extension can still seem difficult to apprehend. In this sub-series, we will review in more details some of those aspects. For example, it is difficult to interpret a RVV assembly instruction. Let's review the main component of such instructions and study various examples.

Overview

We will draw some generalities from a practical example: a masked integer vector-scalar addition vadd.vx v12, v3, v4, v0.t.
The following diagram illustrates the 6 main components of any RVV assembly instruction. Most of those components have numerous variants and some of them are optional.



RVV assembly instruction field description

Mnemonic

The first component is the mnemonic which describes which operation should be performed by the instruction (e.g.  in our case vadd will perform a vector add while vcompress will compress a bitmask).

The mnemonic often describes the destination type: for example vadd is a vector add with a single-width destination, while vwadd is a widening vector add and vmadc is a vector addition with carry returning a mask of output carries.

Operand type(s)

The second component is the operand type(s). It describes on what type of operand(s) the operation is performed. The most common is .vv (vector-vector) which often means that the instruction admits at least two inputs and both are single-width vectors.

The list of various possibilities includes:

  • .vv operation between two (or three) single-width vectors, e.g. vmul.vv
  • .wv operation between two (or three) vectors, vs1 is single-width while vs2 and vd are wide operands/destinations (EEW=2*SEW), e.g. vwmacc.wv
  • .vx / .vf operation between one or multiple vector and a scalar (general purpose register: x or floating-point register f), e.g. vfadd.vf, the scalar operand is splat to build a vector operand
  • .wx / .wf: operation between a scalar and a wide second vector operand, e.g. vfwadd.wf
  • .vi operation between one or multiple vectors and an immediate. The immediate is often 5-bit wide encoded in the rs1/vs1 field, e.g. vsub.vi
  • .vs: operation between a vector and a single element (scalar) contained in a vector register, e.g. vredsum.vs (the single scalar is used to carry the reduction accumulator)
  • .vm: operation between a vector and a mask, e.g. vcompress.vm
  •  .v: operation with a single vector input, e.g. vmv2r.v, may also be used for vector loads (which have scalar operands for the address on top of a single data vector operand: vle16.v.
  • .vvm / .vxm / .vim operation vector-vector / vector-scalar / vector-immediate with a mask operand (e.g. vadc.vvm, addition between two vectors with an extra mask operand constituting an input carry vector)

The conversions constitutes a category of its own for the operand types, because the mnemonic suffix describes: the destination format, the source format, and the type of operand. For example vfcvt.x.f.v is a vector (.v) conversion from floating-point element (.f) to signed integer (.x) result elements. .xu is used to indicate unsigned integers, .rtz is used to indicate a static round-towards-zero rounding mode.


Destination and source(s)

In the assembly instruction, destination and sources follows the mnemonic. The destination is the first register to appears, followed by one or multiple sources.
Each of those element encodes a register group. The destination and source operands register groups are represented by the first register in the group (for example if LMUL=4, then v12 represents the 4-wide register group v12v13v14v15). Thus the actual register group depends on the assembly opcode but also on the value of vtype: it is context sensitive. Most RVV operations have a vector destination, denoted by vd, some may have a scalar destination (e.g. vmv.x.s with a x register destination or vfmv.f.s with a f register destination) and others have a memory destination such as the vector stores, e.g. vse32.v.

There can be one or two sources: vs2 and vs1 for vector-vector instructions. If the operations admits a scalar operand, or an immediate operand then vs1 is replaced by rs1 (respectively imm), e.g. vfadd.vf v4, v2, ft3. Vector loads have a memory source, e.g.  vloxei8.v vd, (rs1), vs2 [, vm] which has a scalar register as address source and a vector register as destination source.

RVV defines 3-operand instruction, e.g. vmacc.vv. For those operations the destination register vd is both a source and a destination: the operation is destructive: one of the source operand is going to be overwritten by the result.

Mask operand

Most RVV operation can be masked: in such case the v0 register is used as a mask to determine which elements are active and whose elements of the result will be copied from the old destination value or filled with a pre-determined pattern. RVV 1.0 only supports true bit as active masks: the element is considered active if the bit at the corresponding index is set to 1 in v0, and inactive if it is 0. This is what is encoded by the last operand of our example: v0.t (for v0 "true"). If this last operand is missing, then the operation is unmasked (all body elements are considered active).

More information can be found in this post of the original series: RVV in a nutshell (part 3): operations with and on masks.

Conclusion

We hope this post has shed some lights on the syntax of RISC-V Vector assembly instructions. We will review other concepts related to the vector extension in future posts.

Reference:

Support of half precision floating-point numbers in RISC-V ISA: Zfh and Zfhmin

Floating-point support in RISC-V ISA

RISC-V does not mandate the support of any floating-point operations as part of the base ISA (RV32I and RV64I) but single precision and double-precision extensions (e.g. RV64F and RV64D) were among the first to be specified. The F extension added 32 floating-point registers, floating-point arithmetic operations, moves and conversions instructions. The D extension (which requires the F extension) extended this set of operation to also support double precision. There also exists a Q extension for quad-precision. 

In reality, there are multiple F and D extensions: RV32F (to extend the 32-bit base ISA) and RV64F (for the 64-bit base ISA), similarly there are RV32D and RV64D. RV32D has the particularity of adding 64-bit wide floating-point registers when the associated general purpose registers are only 32-bit wide. This flexibility of RISC-V ISA is reviewed in the post RISC-V Register Files.

Initially, smaller floaitng-point formats, such as half precision, were not supported. 

Half precision used to be lagging behind in term of ISA and hardware support. It was only specified has a storage format in the 2008 revision of the IEEE-754 standard (the IEEE standard specifying floating-point formats and operations). The momentum of deep learning and convolutional neural networks has kick-started a renewed interest for small number formats and in particular half precision (among many others). 

Note: In the IEEE-754 2019 revision, half precision is still not defined has a basic format but as an interchange format. Although the standard is more permissive in terms of which formats can admit arithmetic operations.

RISC-V International (the association behind the RISC-V specification(s)) recently ratified two extensions specifying sets of instructions for half-precision support: Zfh and Zfhmin. The later being defined as a subset of the former, we will review Zfh first.

Zfh: full half-precision support

Zfh extends floating-point registers (FRF) to support half precision. It requires the F extension, thus no new register file, nor any register width extension is required since the floating-point registers are already wide enough to contain single precision values.
As other single-precision values in RV32D/RV64D, half precision values are stored in larger FLEN registers using RISC-V NaN boxing scheme (more on that in a future post). 

This new extension defines:
  • Instruction to move data (with or without conversions):
    • flh: load from memory into an F register
    • fsh: store to memory from an F register
    • fmv.h.x, fmv.x.h: bit pattern move between X and F register files (unmodified)
    • fcvt.h.(w/l)[u], fcvt.(w/l)[u].h: conversions  between X and F register files (from/to integer)
  • Arithmetic operations: fadd.h, fsub.h, fmul.h, fsqrt.h, fdiv.h, fmin.h, fmax.h, f(n)madd/f(n)msub.h
  • Floating-point comparisons: fcmp.h
  • Conversions between half precision and other floating-point formats
  • Miscellaneous: fclass.hfsgn(/n/x).h

All arithmetic operations operate on uniform I/Os: all operands are half precision values, and the output is an half precision result. They share their opcode with the corresponding F/D instructions: the fmt fields (bits 25 and 26) value 2b'10 (2) encodes half precision (value 2b'00 for single precision, 2b'01 for double, and 2b'11 for quad but who need such a precision !).


The following diagram represents the data move instructions:



The availability of move to/from and conversion with 64-bit operands is conditioned on the availability of RV64 (general purpose move and integer conversions) and D extension (floating-point conversions).
Although certainly anecdotal, the availability of the Q extension (floating-point quad precision, a.k.a 128-bit format): instructions for conversions from/to quad precision are defined: fcvt.h.q and fcvt.q.h.

Zfhmin: reduced half-precision support

The extension Zfhmin can be seen as a subset of Zfh. It only mandates support for half-precision load and store, bit pattern moves from/to integer register file (no conversions with integer formats) and conversions with other floating-point formats. It represents a total of 6 instructions (extended to 8 with conversion from/to double precision format if the D extension is supported).
    Zfhmin constitutes a reduced set of instruction which can be used by platforms where computing with half-precision values directly is not required but which still require the capability to manipulate them, in particular in memory, before converting them to a larger format for an eventual computation.

Vector support for half precision: Zvfh and Zvfhmin

The RISC-V Vector extension (RVV) version 1.0 specified vector support for single and double precision (SEW=32 and 64 bits). A draft of the specification (link to source) introduces Zvfh which is an extension of all the floating-point instruction to half-precision , including conversions with other formats, and Zvfhmin which is a really reduced subset of operations. As is the case for other floating-point format, half precision is supported by a specific value of the vsew field of the vtype configuration register: vsew=1. This encoding corresponding to a Selected Element Width (SEW) of 16 bits and is identical for both integer and floating point formats.

Zvfhmin only mandates the support of conversion between half and single precision: it extends the support of vfwcvt.f.f.v (widening half-to-float) and vfncvt.f.f.w (narrowing float-to-half) to SEW=16-bit.

Zvfh extends to half-precision all the floating-point vector operations, floating-point reductions, floating-point moves., a brief description of those operations can be found on Part 2 of our series RISC-V Vector Extension in a Nutshell.

Both Zvfhmin and Zvfh mandates the support of single precision element in vectors. On top of this support Zvfh mandates at least Zfhmin on the scalar floating-point side.

Half precision in RVA22 profile

The RISC-V consortium defines profiles. These profiles aim at defining a common set of mandatory extensions and a reduced set of optional extensions which can be used by hardware and software providers to build a compatible ecosystem without having to deal with more specialized ISA extensions. Profile descriptions can be found on RISC-V github.

RVA22 is the most recent profile, it is dedicated for 64-bit application processors

Zfhmin is part of the mandatory extensions of the RVA22 profile, while Zfh is an optional extension (which supersede Zfhmin when selected). This means that all application processors targeting compatibility with the RISC-V ecosystem must have a minimal support for half precision, and than extended support is part of the extended profile. Neither of the vector extensions Zvfh nor Zvfhmin are required in the RVA22 profile.

Conclusion

RISC-V Half-precision support in scalar operations has already been ratified (as extensions Zfh and Zfhmin) and his part of the latest application profile (RVA22). There exist draft specifications for the support of half-precision in vector operations: Zvfh and Zvfhmin. Other formats, such as BFloat16 or more esoteric number formats, should follow in the coming years.
RISC-V is an open community so do not hesitate to sign-in to stay up-to-date and participate to the effort: http://riscv.org.

Reference: