The RISC-V Vector Extension (RVV) defines and exploits the concept of vector register groups. This post clarifies what a register group is, how it is useful and the caveat associated with it.
A single vector instruction can operate on operands each consisting of more than one register and can produce similarly extended results.
A group is a bundle of multiple registers. RVV 1.0 supports groups of 1, 2, 4 or 8 registers. The indices of registers in a group span a contiguous range and the lowest index must be a multiple of the group size. For example the vector registers [4, 5, 6, 7] form a valid group for size 4, but [5, 6] is not a valid group for size 2. The lowest index is used to specify the vector register group in an instruction opcode: e.g. when LMUL=4, v4 stands for v4v5v6v7. Thus the actual sizes of the operands (and destination) of an instruction depends on the context (the vtype configuration as we will see later): vadd.vv v4, v0, v12 may operate on 1, 2 or 4 wide register groups depending on the context (8 is not a legal size as neither v4 nor v12 are aligned on an index multiple of 8).
The size of a vector register group is called the vector length multiplier (a.k.a LMUL) in RVV specification: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#342-vector-register-grouping-vlmul20.
For an operation, LMUL is set by the vlmul field in vtype (so it is usually modified through vset instructions). The following RISC-V assembly sequence first defines LMUL as 8 and then executes a double precision vector addition.
vsetvli x1, zero, e64, m8, ta, mu
vfadd.vv v16, v8, v24
Since the LMUL value is 8, the actual sequence of operation is equivalent to the following (assuming each instruction operates on a single-register wide group):
vfadd.vv v16, v8, v24
vfadd.vv v17, v9, v25
vfadd.vv v18, v10, v26
vfadd.vv v19, v11, v27
vfadd.vv v20, v12, v28
vfadd.vv v21, v13, v29
vfadd.vv v22, v14, v30
vfadd.vv v23, v15, v31
LMUL encoding in vtype
LMUL vs EMUL
Benefits of register grouping
- even if architectural registers are "only" VLEN-bit wide, single instructions can perform operation on larger operands and produce larger results:
- for widening operation results or narrowing operation source, having the possibility of building a register groups enables the use of a full register source or enables populating a full register destination
- providing large input sets: for example for a vrgather the input data set contains VLMAX elements, where VLMAX=VLEN/SEW*LMUL increases with LMUL: one can build larger lookup table by using larger register groups.
- allowing reuse of operands between extended registers (e.g. immediate or scalar operand)
- reducing code size: for data parallel section with vector lengths larger than what can fit in a single vector register, less instructions are required to operate on the same amount of data.
- reducing number of loop iterations: since VLMAX increases with LMUL, during loop stripmining, less iterations are required when LMUL increases.
- exposing more opportunity to micro-architecture
- even if a micro-architecture can not operate directly on register groups larger than VLEN-bit wide, it can still exploit micro-op parallelism and chaining opportunities exposed by larger register groups.
Constraints induced by register grouping
Conclusion
Resources
- "Vector Operands" section of the specification detailing EMUL https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#sec-vec-operands
- Mapping of vector elements in vector register groups when LMUL > 1: https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#sec-vec-operands