RISC-V Vector Register Groups

 The RISC-V Vector Extension (RVV) defines and exploits the concept of vector register groups. This post clarifies what a register group is, how it is useful and the caveat associated with it.

A single vector instruction can operate on operands each consisting of more than one register and can  produce similarly extended results.

 A group is a bundle of multiple registers. RVV 1.0 supports groups of 1, 2, 4 or 8 registers. The indices of registers in a group span a contiguous range and the lowest index must be a multiple of the group size. For example the vector registers [4, 5, 6, 7] form a valid group for size 4, but [5, 6] is not a valid group for size 2. The lowest index is used to specify the vector register group in an instruction opcode: e.g. when LMUL=4, v4 stands for v4v5v6v7. Thus the actual sizes of the operands (and destination) of an instruction depends on the context (the vtype configuration as we will see later): vadd.vv v4, v0, v12 may operate on 1, 2 or 4 wide register groups depending on the context (8 is not a legal size as neither v4 nor v12 are aligned on an index multiple of 8).

The size of a vector register group is called the vector length multiplier (a.k.a LMUL) in RVV specification: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#342-vector-register-grouping-vlmul20

For an operation, LMUL is set by the vlmul field in vtype (so it is usually modified through vset instructions). The following RISC-V assembly sequence first defines LMUL as 8 and then executes a double precision vector addition.

	vsetvli	x1, zero, e64, m8, ta, mu
	vfadd.vv	v16, v8, v24

Since the LMUL value is 8, the actual sequence of operation is equivalent to the following (assuming each instruction operates on a single-register wide group):

	vfadd.vv	v16, v8,  v24
	vfadd.vv	v17, v9,  v25
	vfadd.vv	v18, v10, v26
	vfadd.vv	v19, v11, v27
	vfadd.vv	v20, v12, v28
	vfadd.vv	v21, v13, v29
	vfadd.vv	v22, v14, v30
	vfadd.vv	v23, v15, v31

LMUL encoding in vtype

In RVV 1.0, vlmul can take 4 values for integral LMUL: 0b000 (LMUL=1), 0b001 (LMUL=2), 0b010 (LMUL=4), 0b011 (LMUL=8), it can also take 3 fractional values: 0b111 (LMUL=1/2), 0b110 (LMUL=1/4), 0b101 (LMUL=1/8), but more on that in a future post.

LMUL vs EMUL

The actual length multiplier used for an operation operand or destination is called EMUL, for effetctive LMUL. EMUL depends on vlmul, but also on the operation. For example EMUL=LMUL for any source or destination of a vadd.vv, but EMUL=2*LMUL for both vd and vs2 of vwadd.wv (because of its widening characteristic). EMUL can never be larger than 8 (it is not possible for a vector register group to exceed 8 registers) so LMUL=8 is reserved for widening instructions: it is not supported in RVV 1.0.

Some operations which embed the effective element width (EEW) in their opcode define EMUL to keep the following ratio equality EEW/EMUL=SEW/LMUL. For example if LMUL=1, SEW=8-bit, then vle32 admits EEW=32-bit and EMUL=EEW*LMUL/SEW=4, this also mean the register group destination will have to be described by a register whose index is a multiple of 4 (EMUL).

Since EMUL can differ between an instructions operands and it results, partial or complete overlap between register groups are a possibility. RVV 1.0 authorizes 3 types of overlap between source and destination register groups. Those overlaps are specified here https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#52-vector-operands and illustrated by the Figure below (e.g. of v0v1v2v3 register group in various source/destination EEW configurations).


Benefits of register grouping

There are several benefits associated to the availability of register groups:
  • even if architectural registers are "only" VLEN-bit wide, single instructions can perform operation on larger operands and produce larger results:
    • for widening operation results or narrowing operation source, having the possibility of building a register groups enables the use of a full register source or enables populating a full register destination
    • providing large input sets: for example for a vrgather the input data set contains VLMAX elements, where VLMAX=VLEN/SEW*LMUL increases with LMUL: one can build larger lookup table by using larger register groups.
    • allowing reuse of operands between extended registers (e.g. immediate or scalar operand)
    • reducing code size: for data parallel section with vector lengths larger than what can fit in a single vector register, less instructions are required to operate on the same amount of data. 
    • reducing number of loop iterations: since VLMAX increases with LMUL, during loop stripmining, less iterations are required when LMUL increases.
  • exposing more opportunity to micro-architecture
    • even if a micro-architecture can not operate directly on register groups larger than VLEN-bit wide, it can still exploit micro-op parallelism and chaining opportunities exposed by larger register groups.

Constraints induced by register grouping

One of the caveat of register grouping is that the number of architectural registers decreases with the length multiplier value. For example there are only 4 8-wide register groups: v0, v8, v16, v24. This means that register allocation pressure can be a limiting factor when using larger register groups: it makes it more difficult to maintain in register all variables simultaneously alive in a sequence and memory spilling may be required.

This pressure can even be augmented by the fact that v0 is used as the mask operand in RVV 1.0, this may also limit the number of register groups: when using masked operations with large register groups, v0 often prevents a program from using the first register group. For example, when LMUL=4, if one wants to use v0 as the mask input, then one cannot use the v0,v1,v2,v3 group. This group can still be partially exploited with LMUL=1 for example to build the v0 mask or partially for LMUL=2 (v2,v3).

The availability of larger inputs may not be well supported by the implementation: for example a LMUL=8 input data set for a vrgather may not be supported with the same latency and/or throughput as a vrgather with LMUL=1, since each result element may originate from a larger pool. Thus even if legal the instruction configuration may not be efficient but the implementation still needs to support it one way or the other.

Conclusion

Vector register groups and the associated length multiplier is an interesting concept which can be leveraged thanks to the vector length agnostic characteristic of RVV where the software build around the vector length parameter can be agnostic of VLEN and thus easily extended to VLEN*LMUL register groups. If you want to learn more on the RISC-V Vector Extension you may consider reading our RVV in a Nutshell blog series starting with part 1.

Resources




No comments:

Post a Comment