RISC-V Vector Element Groups

RISC-V Vector extension has defined multiple ways to organize vector data over one or multiple vector registers. In this previous blog post we presented the concept of vector register groups defined by RVV 1.0: the capacity to group together multiple vector registers into a meta (larger) vector register which can be operated on as a single operand by most instructions and thus extend the size of vector operands and results. Recently a new concept was introduced: vector element groups: considering multiple contiguous elements as a single larger element and operate on a group as if it was a single element. The concept was suggested by Krste Asanovic in this email; and later specified in a standalone document of the vector spec: https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

Definition

A vector element group is defined by an effective element width (EEW) and an element group size (EGS), it is a group of EGS elements, each EEW-bit wide. The total group width (in bits) is called the Element Group Width (or EGW, EGW = EEW * EGS). 

NOTE: the single element width parameter implies that all elements in an element group have the same width.

The element group is useful to manipulate multiple data elements which make sense as a block (e.g. a 128-bit ciphertext for the AES cipher algorithm) without the need to define large element widths and implement their support in hardware.

An element group can be fully contained in one vector register or can overlap multiple registers. In the former case, a single vector register can contain multiple element groups.

An element group can also have EGW larger than an implementation VLEN; in this case a multi-register group is required to fit a single element group. The same constraints as any vector register group apply: the register group is encoded by its first register whose index must be a multiple of the group size.

EEW can either be specific to the opcode or defined through SEW. For example most vector crypto instructions defines EEW from SEW (even if only a small subset of values are legal): it is required to define vtype properly before executing a vector crypto instruction. This is in particular useful to reuse the same instruction for different algorithm: e.g. setting SEW=32 bits, vl=4 and executing a vsha2c will perform a SHA-256 message compression, while setting SEW=64 bits and vl=4 and executing a vsha2c will perform a SHA-512 message compression . We will provide more detail on the new vector crypto extension in a future post.

Contrary to EEW, EGS is always defined by the opcode: it is not a new vtype field. For example vsha2ms (SHA-2 message schedule) statically defines EGS as 4.

Constraints on vl and vstart

The vector lenght (vl) of an element-group operand is counted in elements and not in terms of groups. And the vector length must be a multiple of the element group size EGS, the same constraint applies to vstart. Other cases are reserved. This mean that operand on a single element group with EGS=8, requires setting vl to 8, operating on 3 elements groups requires setting vl to 24 and so forth.

The case when vl (or vstart) is not a multiple of EGS is reserved and executing an instruction working on element group may result in an illegal instruction exception being signaled.

Masking and element groups

The specification of element groups leaves a lot of room when it comes to masking: masking support is defined on a per operation basis. The concept allow for per-element masking and mask setting or per element-group (if any or all mask bits corresponding to elements in the group are set). The concept does not seem to cover the cases where a single mask bit corresponds to a full group regardless of the actual number of element in the group: mask bit 0 would correspond to group 0, mask bit 1 to group 1 ... This case would behave similarly to masking with SEW=EGW.

In the only existing use case (the vector cryptography extension), none of the instructions defined as operating on element groups support masking, so the problem of element group masking implementation will be delayed to a future extension.

Examples and use cases

The following diagram illustrates two examples of element groups. The top element group has EGW half as wide as VLEN and so two element groups fit in a single vector registers. The bottom example has EGW twice as wide as VLEN and so a 2-register group is required to fit a single element group.


The element group concept has been first used by the vector cryptography extension proposal (draft under architectural review at the time of writing). Different element groups configurations are used:

  • 4x 32-bit elements for AES, SM4, GHMAC and SHA-256
  • 8x 32-bit elements for SM3
  • 4x 64-bit elements for SHA-512
As mentioned earlier, the vector crypto extension requires SEW to be set to an element width value supported by the instruction being execution, and considers all other cases as reserved.

Difference between element groups and larger SEW

One of the relevant question one may ask regarding element groups is: what is the difference between an element group with EGW=128 (e.g. EEW=32 and EGS=4) and a single element with SEW=128 ?

The first fact is that currently SEW (as defined by the vsew field of the vtype register) cannot exceed 64 bits (RVV 1.0 spec section). Although the vsew field is large enough to accomodate larger values, those encodings are currently reserved. So element groups bring support for larger data blocks without requiring new vsew encodings.

A second fact is that support for element groups is possible even if EGW exceeds ELEN, the maximal element width supported by the architecture. In fact EGW can even exceed VLEN: this case is part of the element group concept and is supported by laying down an element group across a multi-register vector register group. This is not currently supported for single element. This translate a fact that operations using element groups do not need an actual datapath larger than ELEN, most operation are performed on ELEN-bit wide data or less.

A third fact is that supporting element groups can be done mostly transparently from the micro-architecture point of view: there is not much difference between a single element group of EGS=4 and EEW=32 and a 4-element vector with EEW=32: internal element masking and tail processing can reuse the same datapaths,

Conclusion

Vector element groups represent a new interesting paradigm to extend the capabilities of vector processor, similar to the concept of vector register group with integer or fractional group length multiplier (reviewed in the RISC-V Vector Register Groups post). They allow the reuse of existing SEW, vl and LMUL values to have a different meaning and different constraints. We will review the use of vector element groups when we review the upcoming vector crypto extensions (stay tuned for this blog series).

References:

RISC-V Compressed Instructions (part 2): Zc extensions

In this second blog post on RISC-V compressed extension we will review the Zc* family of extension designed to improve code size for embedded platforms (if you have not read the first part, it is accessible here).

RISC-V Zc* extension

Zc* is a family of 6 extensions: Zca, Zcf, Zcd, Zcb, Zcmp, and Zcmt.
Zcf, Zcd, and Zca are new labels for existing parts of the C extensions: Zcf contains the single-precision compressed load/stores (include from/to stack), Zcd contains the same type of instructions but for double precision and Zca contains every other C extension instructions.

As the C extension was covered in this post, we will focus here on the other sub-extensions.

Overview

Zcb extension

The Zcb extension contains 12 instructions: 5 load/store operations, 5 sign/zero extension operations, 1 multiplication instruction and one bitwise not.
All 12 instructions are limited to the 3-bit register addressing, targeting the subset of 8 registers (x8 - x15) defined in the C extension. They all have a single 32-bit instruction equivalent, but contrary to the C extension not all equivalent instructions can be found in the base ISA: the zero/sign extension equivalents are in Zbb (bitmanip extension) and the c.mul is in the M extension (or Zmmul). The extension of the equivalent instruction must also be implemented to allow the compressed version: for example M or Zmmul is required to support c.mul.
   Zcb loads and stores bring compressed support for smaller sizes than the base C extension: while the C extension was offering only word and double word loads and stores (from/to generic addresses or only from/to the stack), Zcb offers loads and stores for bytes and half-words. Loads can sign extend or zero extend the value read from memory into the destination registers. Both load and store use memory addresses formed by adding a small immediate to a base address read from a general purpose register (no load/store limited to the stack pointer).


Zcmp: push, pop and paired register atomic move

Zcmp introduces 6 new instructions, 4 of them are specifically designed to reduce the code size of function calls and returns: push, pop, popret, popretz. Those instructions are well suited for implementation in embedded processors. Those instructions cover a a large range of common operations mandated by the RISC-V ABI when calling or returning from a function.

They bundle stack frame allocation (push) / de-allocation (pop/popret), register spilling (push) / restoring (pop/popret) and function return (popret).
The stack frame is the section of the call stack used by a function call, for example to store arguments which do not fit in registers, to save callee-saved registers when required ...

The last 2 instructions are paired register moves: cm.mva01s copies data from two registers in the s0-s7 range into a0 and a1 (the two s* can be the same register) and cm.mvsa01 copies the content of a0 and a1 into two distinct registers of the s0-s7 range. These two instructions behaviour is atomic: the intermediary state when only one of the two moves has been performed is never architecturally visible. Hence the two instructions do not expand in a single equivalent instruction or even in a sequence of two instructions which can be split or interrupted.

Zcmt: table jump

The Zcmt extension introduces support for an indexed jump to an address stored in a jump table, reducing a sequence of multiple instructions to a single instruction as long as the jump target can be stored in a 256-entry jump table.

The Zcmt extension introduces a new Control/Status register (CSR): JVT, it also adds two new instructions: cm.jt and cm.jalt. cm.jalt is an extended version of cm.jt: on top of cm.jt behaviour it links the return address to ra: copying the address of the next instruction after the cm.jalt to the ra register (to allow to return to it later)
The JVT register is a WARL (write anything reads legal) CSR register: which means any value can be written, and will not raise an illegal exception, but only legal values can be read back. This register contains a 6-bit mode field (located in the LSB) and reserved for future use, and an (XLEN - 6) bit wide base field.
If JVT.mode is a reserved value then cm.jt and cm.jalt encodings are also reserved: implementation may raise illegal instruction exception when executing them. If JVT.mode is 0, then cm.jt/cm.jalt are decoded successfully, and an index value is extracted from the 8-bit opcode bitfield located between bits 2 and 9. Both instructions shares the same opcode bit field values, the difference is made on the index value: if this index < 32, the instruction is decoded as cm.jt else it is decoded as cm.jalt.
The instruction builds a table_address as JVT.base + index << (2 for RV32, 3 for RV64) and extracts the jump target address as the content of the program memory at address table_address. The hart then jumps to that address (and link the return address to ra in the case of cm.jalt only).

This is illustrated by the following diagram, in this example the hart control flow eventually jumps to the address 0x1337 stored into the i-th entry of the jump table.

Zc* benchmark studies (spreadsheet) show up to 10% code size gain. The gain is double: transforming a sequence of up to two instructions to a single instruction and a table entry and factorizing that table entry among many different program location: the asymptotic cost when the number of call site grows becomes a single instruction.

Encoding and decoding

The encoding of compressed instructions was studied in the previous blog post on the compressed extension C. But this post did not detail how compressed instructions can be decoded from non-compressed instructions. This decoding step is particularly important since it determines if an instruction occupies 16 or 32 bits, which also determines when the next instruction starts. Super-scalar implementations needs to do that on large bit vectors within a few logic levels to ensure proper instruction throughput through decode and smaller implementation needs to do that efficiently to avoid wasting power decoding the wrong size of instruction.




The base extension has a 7-bit opcode stored in the 7 least significant bits (visible in the R-type encoding) of the figure above. The compressed version uses a 2-bit opcode field stored in the 2 least  significant bits. For all base extensions (including the V extension), the 2 least significant bits (out of 7) always take the value 0x3, while it can takes values 0x0, 0x1 or 0x2 for compressed instructions. Thus only 2 bits are required to discriminate between compressed and uncompressed instructions and to determine how many bytes (2 or 4) must be considered to get the complete instruction half-word/word.


Compatibility

Instructions of the Zcmp and Zcmt extensions reuse encodings of compressed double precision floating-point loads and stores and are thus incompatible with these instructions. This is in line with the philosophy of those new extensions: improve code size for embedded platforms where floating-point instructions are less critical (and may not even be implemented in their uncompressed formats).

Conclusion

Code size is a key metric for ISA performance and can greatly impact overall program performance.
In addition to the base C extension, the Zc* family of extensions provide a way to greatly improve code size for embedded applications by compressing whole program sequences into single instruction, including function start sequences and return sequences, ABI register moves and table jumps. The code size reduction task group has demonstrated substantial savings, their methods and result can be consulted on https://github.com/riscv/riscv-code-size-reduction. These extensions are mostly targetted at embedded systems.

References: