Coding Illness

Moving to substack

Dear reader,

This blog is moving to substack: https://fprox.substack.com/

You will find past articles (with corrections and extensions) and new ones as time goes on.

For example, a list of posts related to the RISC-V vector extensions can be found here: https://fprox.substack.com/p/risc-v-vector-in-a-nutshell

If you'd like you can subscribe to the substack and receive an email when new posts are published.

See you there.

RISC-V Vector extension has defined multiple ways to organize vector data over one or multiple vector registers. In this previous blog post we presented the concept of vector register groups defined by RVV 1.0: the capacity to group together multiple vector registers into a meta (larger) vector register which can be operated on as a single operand by most instructions and thus extend the size of vector operands and results. Recently a new concept was introduced: vector element groups: considering multiple contiguous elements as a single larger element and operate on a group as if it was a single element. The concept was suggested by Krste Asanovic in this email; and later specified in a standalone document of the vector spec: https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc.

Definition

A vector element group is defined by an effective element width (EEW) and an element group size (EGS), it is a group of EGS elements, each EEW-bit wide. The total group width (in bits) is called the Element Group Width (or EGW, EGW = EEW * EGS).

NOTE: the single element width parameter implies that all elements in an element group have the same width.

The element group is useful to manipulate multiple data elements which make sense as a block (e.g. a 128-bit ciphertext for the AES cipher algorithm) without the need to define large element widths and implement their support in hardware.

An element group can be fully contained in one vector register or can overlap multiple registers. In the former case, a single vector register can contain multiple element groups.

An element group can also have EGW larger than an implementation VLEN; in this case a multi-register group is required to fit a single element group. The same constraints as any vector register group apply: the register group is encoded by its first register whose index must be a multiple of the group size.

EEW can either be specific to the opcode or defined through SEW. For example most vector crypto instructions defines EEW from SEW (even if only a small subset of values are legal): it is required to define vtype properly before executing a vector crypto instruction. This is in particular useful to reuse the same instruction for different algorithm: e.g. setting SEW=32 bits, vl=4 and executing a vsha2c will perform a SHA-256 message compression, while setting SEW=64 bits and vl=4 and executing a vsha2c will perform a SHA-512 message compression . We will provide more detail on the new vector crypto extension in a future post.

Contrary to EEW, EGS is always defined by the opcode: it is not a new vtype field. For example vsha2ms (SHA-2 message schedule) statically defines EGS as 4.

Constraints on vl and vstart

The vector lenght (vl) of an element-group operand is counted in elements and not in terms of groups. And the vector length must be a multiple of the element group size EGS, the same constraint applies to vstart. Other cases are reserved. This mean that operand on a single element group with EGS=8, requires setting vl to 8, operating on 3 elements groups requires setting vl to 24 and so forth.

The case when vl (or vstart) is not a multiple of EGS is reserved and executing an instruction working on element group may result in an illegal instruction exception being signaled.

Masking and element groups

The specification of element groups leaves a lot of room when it comes to masking: masking support is defined on a per operation basis. The concept allow for per-element masking and mask setting or per element-group (if any or all mask bits corresponding to elements in the group are set). The concept does not seem to cover the cases where a single mask bit corresponds to a full group regardless of the actual number of element in the group: mask bit 0 would correspond to group 0, mask bit 1 to group 1 ... This case would behave similarly to masking with SEW=EGW.

In the only existing use case (the vector cryptography extension), none of the instructions defined as operating on element groups support masking, so the problem of element group masking implementation will be delayed to a future extension.

Examples and use cases

The following diagram illustrates two examples of element groups. The top element group has EGW half as wide as VLEN and so two element groups fit in a single vector registers. The bottom example has EGW twice as wide as VLEN and so a 2-register group is required to fit a single element group.

The element group concept has been first used by the vector cryptography extension proposal (draft under architectural review at the time of writing). Different element groups configurations are used:

4x 32-bit elements for AES, SM4, GHMAC and SHA-256
8x 32-bit elements for SM3
4x 64-bit elements for SHA-512

As mentioned earlier, the vector crypto extension requires SEW to be set to an element width value supported by the instruction being execution, and considers all other cases as reserved.

Difference between element groups and larger SEW

One of the relevant question one may ask regarding element groups is: what is the difference between an element group with EGW=128 (e.g. EEW=32 and EGS=4) and a single element with SEW=128 ?

The first fact is that currently SEW (as defined by the vsew field of the vtype register) cannot exceed 64 bits (RVV 1.0 spec section). Although the vsew field is large enough to accomodate larger values, those encodings are currently reserved. So element groups bring support for larger data blocks without requiring new vsew encodings.

A second fact is that support for element groups is possible even if EGW exceeds ELEN, the maximal element width supported by the architecture. In fact EGW can even exceed VLEN: this case is part of the element group concept and is supported by laying down an element group across a multi-register vector register group. This is not currently supported for single element. This translate a fact that operations using element groups do not need an actual datapath larger than ELEN, most operation are performed on ELEN-bit wide data or less.

A third fact is that supporting element groups can be done mostly transparently from the micro-architecture point of view: there is not much difference between a single element group of EGS=4 and EEW=32 and a 4-element vector with EEW=32: internal element masking and tail processing can reuse the same datapaths,

Conclusion

Vector element groups represent a new interesting paradigm to extend the capabilities of vector processor, similar to the concept of vector register group with integer or fractional group length multiplier (reviewed in the RISC-V Vector Register Groups post). They allow the reuse of existing SEW, vl and LMUL values to have a different meaning and different constraints. We will review the use of vector element groups when we review the upcoming vector crypto extensions (stay tuned for this blog series).

References:

Section on vector element groups in the RISC-V vector crypto specification: https://github.com/riscv/riscv-crypto/blob/master/doc/vector/riscv-crypto-vector-element-groups.adoc
Section on vector element groups in the RISC-V Vector specification (not part of any released version yet): https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

RISC-V Compressed Instructions (part 2): Zc extensions

In this second blog post on RISC-V compressed extension we will review the Zc* family of extension designed to improve code size for embedded platforms (if you have not read the first part, it is accessible here).

RISC-V Zc* extension

Zc* is a family of 6 extensions: Zca, Zcf, Zcd, Zcb, Zcmp, and Zcmt.

Zcf, Zcd, and Zca are new labels for existing parts of the C extensions: Zcf contains the single-precision compressed load/stores (include from/to stack), Zcd contains the same type of instructions but for double precision and Zca contains every other C extension instructions.

As the C extension was covered in this post, we will focus here on the other sub-extensions.

Overview

Zcb extension

The Zcb extension contains 12 instructions: 5 load/store operations, 5 sign/zero extension operations, 1 multiplication instruction and one bitwise not.

All 12 instructions are limited to the 3-bit register addressing, targeting the subset of 8 registers (x8 - x15) defined in the C extension. They all have a single 32-bit instruction equivalent, but contrary to the C extension not all equivalent instructions can be found in the base ISA: the zero/sign extension equivalents are in Zbb (bitmanip extension) and the c.mul is in the M extension (or Zmmul). The extension of the equivalent instruction must also be implemented to allow the compressed version: for example M or Zmmul is required to support c.mul.

Zcb loads and stores bring compressed support for smaller sizes than the base C extension: while the C extension was offering only word and double word loads and stores (from/to generic addresses or only from/to the stack), Zcb offers loads and stores for bytes and half-words. Loads can sign extend or zero extend the value read from memory into the destination registers. Both load and store use memory addresses formed by adding a small immediate to a base address read from a general purpose register (no load/store limited to the stack pointer).

Zcmp: push, pop and paired register atomic move

Zcmp introduces 6 new instructions, 4 of them are specifically designed to reduce the code size of function calls and returns: push, pop, popret, popretz. Those instructions are well suited for implementation in embedded processors. Those instructions cover a a large range of common operations mandated by the RISC-V ABI when calling or returning from a function.

They bundle stack frame allocation (push) / de-allocation (pop/popret), register spilling (push) / restoring (pop/popret) and function return (popret).

The stack frame is the section of the call stack used by a function call, for example to store arguments which do not fit in registers, to save callee-saved registers when required ...

The last 2 instructions are paired register moves: cm.mva01s copies data from two registers in the s0-s7 range into a0 and a1 (the two s* can be the same register) and cm.mvsa01 copies the content of a0 and a1 into two distinct registers of the s0-s7 range. These two instructions behaviour is atomic: the intermediary state when only one of the two moves has been performed is never architecturally visible. Hence the two instructions do not expand in a single equivalent instruction or even in a sequence of two instructions which can be split or interrupted.

Zcmt: table jump

The Zcmt extension introduces support for an indexed jump to an address stored in a jump table, reducing a sequence of multiple instructions to a single instruction as long as the jump target can be stored in a 256-entry jump table.

The Zcmt extension introduces a new Control/Status register (CSR): JVT, it also adds two new instructions: cm.jt and cm.jalt. cm.jalt is an extended version of cm.jt: on top of cm.jt behaviour it links the return address to ra: copying the address of the next instruction after the cm.jalt to the ra register (to allow to return to it later)

The JVT register is a WARL (write anything reads legal) CSR register: which means any value can be written, and will not raise an illegal exception, but only legal values can be read back. This register contains a 6-bit mode field (located in the LSB) and reserved for future use, and an (XLEN - 6) bit wide base field.

If JVT.mode is a reserved value then cm.jt and cm.jalt encodings are also reserved: implementation may raise illegal instruction exception when executing them. If JVT.mode is 0, then cm.jt/cm.jalt are decoded successfully, and an index value is extracted from the 8-bit opcode bitfield located between bits 2 and 9. Both instructions shares the same opcode bit field values, the difference is made on the index value: if this index < 32, the instruction is decoded as cm.jt else it is decoded as cm.jalt.

The instruction builds a table_address as JVT.base + index << (2 for RV32, 3 for RV64) and extracts the jump target address as the content of the program memory at address table_address. The hart then jumps to that address (and link the return address to ra in the case of cm.jalt only).

This is illustrated by the following diagram, in this example the hart control flow eventually jumps to the address 0x1337 stored into the i-th entry of the jump table.

Zc* benchmark studies (spreadsheet) show up to 10% code size gain. The gain is double: transforming a sequence of up to two instructions to a single instruction and a table entry and factorizing that table entry among many different program location: the asymptotic cost when the number of call site grows becomes a single instruction.

Encoding and decoding

The encoding of compressed instructions was studied in the previous blog post on the compressed extension C. But this post did not detail how compressed instructions can be decoded from non-compressed instructions. This decoding step is particularly important since it determines if an instruction occupies 16 or 32 bits, which also determines when the next instruction starts. Super-scalar implementations needs to do that on large bit vectors within a few logic levels to ensure proper instruction throughput through decode and smaller implementation needs to do that efficiently to avoid wasting power decoding the wrong size of instruction.

The base extension has a 7-bit opcode stored in the 7 least significant bits (visible in the R-type encoding) of the figure above. The compressed version uses a 2-bit opcode field stored in the 2 least significant bits. For all base extensions (including the V extension), the 2 least significant bits (out of 7) always take the value 0x3, while it can takes values 0x0, 0x1 or 0x2 for compressed instructions. Thus only 2 bits are required to discriminate between compressed and uncompressed instructions and to determine how many bytes (2 or 4) must be considered to get the complete instruction half-word/word.

Compatibility

Instructions of the Zcmp and Zcmt extensions reuse encodings of compressed double precision floating-point loads and stores and are thus incompatible with these instructions. This is in line with the philosophy of those new extensions: improve code size for embedded platforms where floating-point instructions are less critical (and may not even be implemented in their uncompressed formats).

Conclusion

Code size is a key metric for ISA performance and can greatly impact overall program performance.

In addition to the base C extension, the Zc* family of extensions provide a way to greatly improve code size for embedded applications by compressing whole program sequences into single instruction, including function start sequences and return sequences, ABI register moves and table jumps. The code size reduction task group has demonstrated substantial savings, their methods and result can be consulted on https://github.com/riscv/riscv-code-size-reduction. These extensions are mostly targetted at embedded systems.

References:

github repository for RISC-V code size reduction task group: https://github.com/riscv/riscv-code-size-reduction
Zc* specification V1.0.0-RC5.7 (pdf)
Wikipedia's page on Branch/Jump table https://en.wikipedia.org/wiki/Branch_table

RISC-V Compressed Instructions (part 1): C extension

RISC-V base ISAs (RV32I and RV64I) define 32-bit wide instructions. Those instructions follow the standard RISC instruction set architecture pattern: they fall into a limited set of possible encodings with opcode, register indices and immediate fields always being located in the same bit fields, making it easier to decode. RISC-V instructions can have up to 2 source registers and 1 destination register, with the possibility to use the destination as an extra source (e.g. fmadd). Each register is encoded on a 5-bit index: there are a total of 32 general purpose registers which can be used or written by an instruction (you can find more information on RISC-V register file in this post). This means that up to 15 bits of the instruction word may be dedicated to source/destination encoding, leaving the other 17 bits for immediate and function encodings.

Most programs will often rely heavily on a small subset of instructions which will have a very large static and dynamic frequency compared to all other instructions: static frequency counts the occurrences in the program binary (e.g. from an objdump) while dynamic frequency counts the occurrences during the execution (e.g. from an execution trace). Code structure (loop, functions, ...) and input values affect differently dynamic and static frequencies.

The fact that instructions appear with different frequencies can be exploited by ISA design to reduce code size by encoding the most used instructions on less bits than a general instruction. This can be beneficial for both low power and high performance targets: less memory required to store instructions, more instructions fit in caches, fetching N bytes provides more uops ... with the cost of a more complex decoding and with the caveat that instruction addresses are not aligned to a 32-bit boundary in memory.

Early on RISC-V was extended with compressed instruction extensions. And now multiple standard extensions offer compressed instructions (with encoding less than the initial 32 bits and/or fusing operations from multiple standard instructions). We have divided the survey of RISC-V compressed instructions in two posts: this post reviews some aspects of the first compressed extension, the C extension, while the second one will review the newer Zc* extensions.

RISC-V C extension

To allow code size improvements RISC-V base ISA was rapidly complemented with the Compressed extension, a.k.a the C extension. This extension is defined in Chapter 16 of the unprivileged specification (see pdf).

The C extension defines 9 new instruction formats, each of them fitting on 16-bit. It is not a standalone extensions but is built on top of RV32I or RV64I. It defines a little over 40 new instructions, with some of them only accessible for larger XLEN (RV64 or RV128) or limited to smaller XLEN (RV32FC) and some of them requiring the floating-point extension (F) or the double precision extension (D). Each instruction in the C-extension has an uncompressed counterpart (and can be expanded to this counterpart by the micro-architecture to simplify instruction decoding and execution).

In an assembly program the compressed instructions are easily distinguishable: they start with the prefix "c.", e.g. c.addi is the compressed add with immediate with a non-zero 6-bit immediate. We will review some of the compressed instruction formats and how they compare with their expanded counterparts: arithmetic operations with immediate, arithmetic operations with registers, load/store with focus on load from the stack.

CI-type: compressed operations with immediate

The following diagram illustrates the differences between the encoding of addi in the base ISA (top, 32-bit wide) and in the C extension (bottom, 16-bit wide): the immediate has been reduced from 12 to 6 bits, the same register index is used for both source and destination and the instruction operation is encoded on 5 bits rather than 10. One thing to note is that the register index is still 5-bit wide, all the registers can be addressed in this opcode. For c.addi as for other instances of the CI-type (e.g. compressed load word from stack pointer c.lwsp) the destination must be non-zero (else encoding is reserved). For c.addi, the sign-extended immediate must also be non-zero.

Obviously c.addi is less expressive than addi: smaller immediate range, source and destination have to be the same register. This affects all compressed instructions: they constitutes a subset of their expanded counterparts, hopefully the most useful one.

CA-type: compressed arithmetic instructions

Another compressed format is the CA-type (compressed arithmetic) illustrated by the following diagram:

In the CA-type, one of the source register index (rs1) is fused with destination (rd), and the register indices are now 3-bit wide. Furthermore they do not encode directly a register index, the indirection is presented in the following table (copied from the RVC standard): for example 0b100 encodes x12 in compressed instructions working on general purpose / integer registers. This is why the register indices are labelled with rd', rs1', and rs2' (rather than rd, rs1, and rs2). The encoded registers were selected because they are the most frequently used ones (the specification notes that the RISC-V ABI was even changed to ensure the 8 registers mapped in RVC were among the most used). The following table (copied from RISC-V unprivileged specification chapter 16 section 16.2 on the C extension instruction formats, page 100) provides the mapping between the 8 possible values of encoded register indices in the CA-type and the actual general purpose or floating-point registers, alongside the ABI register names.

Both previous examples of RVC encoding illustrates that compressed instructions are less expressive than their base extension counterparts and use larger parts of the opcode space. This is the price to pay to ensure shorter code and is balanced by the fact that they are more heavily used than the extended versions providing an overall code size reduction.

Compressed load and stores

The C extensions defines 2 types of compressed loads and stores.

The first type of compressed load uses a memory address based on the stack pointer (x2): c.[f]lwsp, c.[f]ldsp, c.lqsp. These instructions use a 6-bit offset scaled by the data size and added to the base address read in the stack pointer register. The destination register index is encoded on 5 bits and the encoding with rd=0 as the destination register is reserved. Since the base address register is always x2 there is no need to encode it in the instruction word. There is no floating-point version of the quad-word load. The store counterparts exist: c.[f]swsp, c[f]sdsp, c.sqsp.

The second type of compressed load/store uses a register-base address: 2 registers are encoded the base address rs1' and the load destination rd' (respectively store source rs2'). Each of those 2 registers is encoded on 3 bits and uses the reduced RVC Register Number encoding listed previously.

Both types of compressed loads and stores uses specific immediate encoding where bit order might look a bit scrambled. According to the specification, the immediate split into various bit fields and their ordering was chosen to ensure that most immediate bits could be found in the same location across various families of instructions.

The following diagram illustrates the difference between the encoding of the lw rd, offset(x2) (I-type, top part) and the c.lwsp rd, offset(sp) (CI type, bottom part). In the compressed encoding, the 5-bit offset is used to encode a 4-byte offset and is implicitly multiplied by 4 before being added to the stack pointer sp/x2. The 5th bit is stored in opcode position 12 as other instructions of type CI and the lower 5-bit immediate field stores the offset [4:2] has its most 3 significant bits and offset [7:6] has its 2 least significant bits. rd cannot be equal to 0 in the compressed instruction encoding.

Control flow instructions: compressed jump and branch

The compressed extension contains 6 control transfers instructions: 2 unconditional jumps with immediate offset (added to pc): c.j and c.jal, 2 unconditional jumps with register targets: c.jr and c.jalr and two branches: c.beqz and c.bnez. Unconditional jumps only exist with an immediate offset or with a direct register target and the branches only exist with an immediate branch offset and with a comparison to zero.

Example of compressed assembly snippet

Let us now consider an illustration of the impact of the C extension on a toy program: a vector add function.

Using godbolt.org compiler explorer (a wonderful tool) we can compile a RV64 program assuming no support for C extension, or assuming it is supported: compiler explorer result.

The example program is simply:

int vector_add(float* r, float* a, float* b, unsigned n) {
    unsigned i = 0;
    for (;i < n; ++i) r[i] = a [i] + b[i];
    return n;
}

The first assembly was compiled by clang 15.0.0 for RV64 with options: "-O2 -march=rv64imfd".

(it also contains a 2-instruction main function)

vector_add:
 beqz	a3,6b4 <vector_add+0x30>
 slli	a4,a3,0x20
 srli	a4,a4,0x20
 flw	ft0,0(a1)
 flw	ft1,0(a2)
 fadd.s	ft0,ft0,ft1
 fsw	ft0,0(a0)
 addi	a1,a1,4
 addi	a2,a2,4
 addi	a4,a4,-1
 addi	a0,a0,4
 bnez	a4,690 <vector_add+0xc>
 mv	a0,a3
 ret

The second assembly was compiled with the same compiler and options: "-O2 -march=rv64imfdc" (the only difference is the addition of the C extension).

vector_add:
 beqz	a3,6a6 <vector_add+0x22>
 slli	a4,a3,0x20
 srli	a4,a4,0x20
 flw	ft0,0(a1)
 flw	ft1,0(a2)
 fadd.s	ft0,ft0,ft1
 fsw	ft0,0(a0)
 addi	a1,a1,4
 addi	a2,a2,4
 addi	a4,a4,-1
 addi	a0,a0,4
 bnez	a4,68c <vector_add+0x8>
 mv	a0,a3
 ret

The two assembly results can look very similar (and they mostly are) but one can notice a few differences. Let's start with the similarities: the instructions mnemonic and the registers used are identical: there are as many instructions in both listings (14) and they use the same registers (same sources and destinations).

On the front of differences: the immediate offset differs (on beqz and bnez): this is because the instruction encoding lengths differ. Let's look at the second compilation result annotated with the program addresses (left, in hexadecimal) and the opcodes (in grey, above the instruction):

Out of the 14 instructions, 9 are encoded on 2 bytes. Without the C extensions the code occupies 56 bytes, and only 38 when the C extension is enabled, representing a 32% saving. The main loop (program from address 0x68c to 0x6a4) contains 9 instructions, including 6 compressed instructions. The asymptotic dynamic saving with compressed instructions is similar to the static code size saving.

One may ask why the compressed version of flw (c.flw) has not been used by the compiler; the reason is simple: this instruction only exists for RV32FC, not for RV64FC.

Another fact is that the 2-byte code alignment requirement for branch target can be witnessed for the target of the first branch beqz which, when taken, branches to address 0x6a6, a 2-byte aligned, but not 4-byte aligned, address.

Conclusion

The C extension provides a reduced encoding for a subset of the base instructions. The subset was selected to cover the most used instructions in their most used form. The reduced encoding allows for code size improvements which benefit both low power and high performance implementations.

For a more in depth overview of the C -extension we invite you to read Andrew Waterman's master thesis report which contains many insights on the impact of compressed instructions on RISC-V program (link to pdf).

References:

RISC-V unprivileged specification v20191213 (including C extension, chapter 16 from page 97): https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf
Repository for RISC-V code size reduction group: https://github.com/riscv/riscv-code-size-reduction

RISC-V Vector Register Groups

The RISC-V Vector Extension (RVV) defines and exploits the concept of vector register groups. This post clarifies what a register group is, how it is useful and the caveat associated with it.

A single vector instruction can operate on operands each consisting of more than one register and can produce similarly extended results.

A group is a bundle of multiple registers. RVV 1.0 supports groups of 1, 2, 4 or 8 registers. The indices of registers in a group span a contiguous range and the lowest index must be a multiple of the group size. For example the vector registers [4, 5, 6, 7] form a valid group for size 4, but [5, 6] is not a valid group for size 2. The lowest index is used to specify the vector register group in an instruction opcode: e.g. when LMUL=4, v4 stands for v4v5v6v7. Thus the actual sizes of the operands (and destination) of an instruction depends on the context (the vtype configuration as we will see later): vadd.vv v4, v0, v12 may operate on 1, 2 or 4 wide register groups depending on the context (8 is not a legal size as neither v4 nor v12 are aligned on an index multiple of 8).

The size of a vector register group is called the vector length multiplier (a.k.a LMUL) in RVV specification: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#342-vector-register-grouping-vlmul20.

For an operation, LMUL is set by the vlmul field in vtype (so it is usually modified through vset instructions). The following RISC-V assembly sequence first defines LMUL as 8 and then executes a double precision vector addition.

	vsetvli	x1, zero, e64, m8, ta, mu
	vfadd.vv	v16, v8, v24

Since the LMUL value is 8, the actual sequence of operation is equivalent to the following (assuming each instruction operates on a single-register wide group):

	vfadd.vv	v16, v8,  v24

	vfadd.vv	v17, v9,  v25

	vfadd.vv	v18, v10, v26

	vfadd.vv	v19, v11, v27

	vfadd.vv	v20, v12, v28

	vfadd.vv	v21, v13, v29

	vfadd.vv	v22, v14, v30

	vfadd.vv	v23, v15, v31

LMUL encoding in vtype

In RVV 1.0, vlmul can take 4 values for integral LMUL: 0b000 (LMUL=1), 0b001 (LMUL=2), 0b010 (LMUL=4), 0b011 (LMUL=8), it can also take 3 fractional values: 0b111 (LMUL=1/2), 0b110 (LMUL=1/4), 0b101 (LMUL=1/8), but more on that in a future post.

LMUL vs EMUL

The actual length multiplier used for an operation operand or destination is called EMUL, for effetctive LMUL. EMUL depends on vlmul, but also on the operation. For example EMUL=LMUL for any source or destination of a vadd.vv, but EMUL=2*LMUL for both vd and vs2 of vwadd.wv (because of its widening characteristic). EMUL can never be larger than 8 (it is not possible for a vector register group to exceed 8 registers) so LMUL=8 is reserved for widening instructions: it is not supported in RVV 1.0.

Some operations which embed the effective element width (EEW) in their opcode define EMUL to keep the following ratio equality EEW/EMUL=SEW/LMUL. For example if LMUL=1, SEW=8-bit, then vle32 admits EEW=32-bit and EMUL=EEW*LMUL/SEW=4, this also mean the register group destination will have to be described by a register whose index is a multiple of 4 (EMUL).

Since EMUL can differ between an instructions operands and it results, partial or complete overlap between register groups are a possibility. RVV 1.0 authorizes 3 types of overlap between source and destination register groups. Those overlaps are specified here https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#52-vector-operands and illustrated by the Figure below (e.g. of v0v1v2v3 register group in various source/destination EEW configurations).

Benefits of register grouping

There are several benefits associated to the availability of register groups:

even if architectural registers are "only" VLEN-bit wide, single instructions can perform operation on larger operands and produce larger results:

for widening operation results or narrowing operation source, having the possibility of building a register groups enables the use of a full register source or enables populating a full register destination
providing large input sets: for example for a vrgather the input data set contains VLMAX elements, where VLMAX=VLEN/SEW*LMUL increases with LMUL: one can build larger lookup table by using larger register groups.
allowing reuse of operands between extended registers (e.g. immediate or scalar operand)
reducing code size: for data parallel section with vector lengths larger than what can fit in a single vector register, less instructions are required to operate on the same amount of data.
reducing number of loop iterations: since VLMAX increases with LMUL, during loop stripmining, less iterations are required when LMUL increases.

exposing more opportunity to micro-architecture

even if a micro-architecture can not operate directly on register groups larger than VLEN-bit wide, it can still exploit micro-op parallelism and chaining opportunities exposed by larger register groups.

Constraints induced by register grouping

One of the caveat of register grouping is that the number of architectural registers decreases with the length multiplier value. For example there are only 4 8-wide register groups: v0, v8, v16, v24. This means that register allocation pressure can be a limiting factor when using larger register groups: it makes it more difficult to maintain in register all variables simultaneously alive in a sequence and memory spilling may be required.

This pressure can even be augmented by the fact that v0 is used as the mask operand in RVV 1.0, this may also limit the number of register groups: when using masked operations with large register groups, v0 often prevents a program from using the first register group. For example, when LMUL=4, if one wants to use v0 as the mask input, then one cannot use the v0,v1,v2,v3 group. This group can still be partially exploited with LMUL=1 for example to build the v0 mask or partially for LMUL=2 (v2,v3).

The availability of larger inputs may not be well supported by the implementation: for example a LMUL=8 input data set for a vrgather may not be supported with the same latency and/or throughput as a vrgather with LMUL=1, since each result element may originate from a larger pool. Thus even if legal the instruction configuration may not be efficient but the implementation still needs to support it one way or the other.

Conclusion

Vector register groups and the associated length multiplier is an interesting concept which can be leveraged thanks to the vector length agnostic characteristic of RVV where the software build around the vector length parameter can be agnostic of VLEN and thus easily extended to VLEN*LMUL register groups. If you want to learn more on the RISC-V Vector Extension you may consider reading our RVV in a Nutshell blog series starting with part 1.

Resources

"Vector Operands" section of the specification detailing EMUL https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#sec-vec-operands
Mapping of vector elements in vector register groups when LMUL > 1: https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#sec-vec-operands

How to read a RISC-V Vector assembly instruction

In our 5 and a half blog series, RVV in a Nutshell, we presented the basics of RISC-V Vector extension (RVV 1.0), but even after this overview some aspects of this large extension can still seem difficult to apprehend. In this sub-series, we will review in more details some of those aspects. For example, it is difficult to interpret a RVV assembly instruction. Let's review the main component of such instructions and study various examples.

Overview

We will draw some generalities from a practical example: a masked integer vector-scalar addition vadd.vx v12, v3, v4, v0.t.

The following diagram illustrates the 6 main components of any RVV assembly instruction. Most of those components have numerous variants and some of them are optional.

RVV assembly instruction field description

Mnemonic

The first component is the mnemonic which describes which operation should be performed by the instruction (e.g. in our case vadd will perform a vector add while vcompress will compress a bitmask).

The mnemonic often describes the destination type: for example vadd is a vector add with a single-width destination, while vwadd is a widening vector add and vmadc is a vector addition with carry returning a mask of output carries.

Operand type(s)

The second component is the operand type(s). It describes on what type of operand(s) the operation is performed. The most common is .vv (vector-vector) which often means that the instruction admits at least two inputs and both are single-width vectors.

The list of various possibilities includes:

.vv operation between two (or three) single-width vectors, e.g. vmul.vv
.wv operation between two (or three) vectors, vs1 is single-width while vs2 and vd are wide operands/destinations (EEW=2*SEW), e.g. vwmacc.wv
.vx / .vf operation between one or multiple vector and a scalar (general purpose register: x or floating-point register f), e.g. vfadd.vf, the scalar operand is splat to build a vector operand
.wx / .wf: operation between a scalar and a wide second vector operand, e.g. vfwadd.wf
.vi operation between one or multiple vectors and an immediate. The immediate is often 5-bit wide encoded in the rs1/vs1 field, e.g. vsub.vi
.vs: operation between a vector and a single element (scalar) contained in a vector register, e.g. vredsum.vs (the single scalar is used to carry the reduction accumulator)
.vm: operation between a vector and a mask, e.g. vcompress.vm
.v: operation with a single vector input, e.g. vmv2r.v, may also be used for vector loads (which have scalar operands for the address on top of a single data vector operand: vle16.v.
.vvm / .vxm / .vim operation vector-vector / vector-scalar / vector-immediate with a mask operand (e.g. vadc.vvm, addition between two vectors with an extra mask operand constituting an input carry vector)

The conversions constitutes a category of its own for the operand types, because the mnemonic suffix describes: the destination format, the source format, and the type of operand. For example vfcvt.x.f.v is a vector (.v) conversion from floating-point element (.f) to signed integer (.x) result elements. .xu is used to indicate unsigned integers, .rtz is used to indicate a static round-towards-zero rounding mode.

Destination and source(s)

In the assembly instruction, destination and sources follows the mnemonic. The destination is the first register to appears, followed by one or multiple sources.

Each of those element encodes a register group. The destination and source operands register groups are represented by the first register in the group (for example if LMUL=4, then v12 represents the 4-wide register group v12v13v14v15). Thus the actual register group depends on the assembly opcode but also on the value of vtype: it is context sensitive. Most RVV operations have a vector destination, denoted by vd, some may have a scalar destination (e.g. vmv.x.s with a x register destination or vfmv.f.s with a f register destination) and others have a memory destination such as the vector stores, e.g. vse32.v.

There can be one or two sources: vs2 and vs1 for vector-vector instructions. If the operations admits a scalar operand, or an immediate operand then vs1 is replaced by rs1 (respectively imm), e.g. vfadd.vf v4, v2, ft3. Vector loads have a memory source, e.g. vloxei8.v vd, (rs1), vs2 [, vm] which has a scalar register as address source and a vector register as destination source.

RVV defines 3-operand instruction, e.g. vmacc.vv. For those operations the destination register vd is both a source and a destination: the operation is destructive: one of the source operand is going to be overwritten by the result.

Mask operand

Most RVV operation can be masked: in such case the v0 register is used as a mask to determine which elements are active and whose elements of the result will be copied from the old destination value or filled with a pre-determined pattern. RVV 1.0 only supports true bit as active masks: the element is considered active if the bit at the corresponding index is set to 1 in v0, and inactive if it is 0. This is what is encoded by the last operand of our example: v0.t (for v0 "true"). If this last operand is missing, then the operation is unmasked (all body elements are considered active).

More information can be found in this post of the original series: RVV in a nutshell (part 3): operations with and on masks.

Conclusion

We hope this post has shed some lights on the syntax of RISC-V Vector assembly instructions. We will review other concepts related to the vector extension in future posts.

Reference:

https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#101-vector-arithmetic-instruction-encoding

Support of half precision floating-point numbers in RISC-V ISA: Zfh and Zfhmin

Floating-point support in RISC-V ISA

RISC-V does not mandate the support of any floating-point operations as part of the base ISA (RV32I and RV64I) but single precision and double-precision extensions (e.g. RV64F and RV64D) were among the first to be specified. The F extension added 32 floating-point registers, floating-point arithmetic operations, moves and conversions instructions. The D extension (which requires the F extension) extended this set of operation to also support double precision. There also exists a Q extension for quad-precision.

In reality, there are multiple F and D extensions: RV32F (to extend the 32-bit base ISA) and RV64F (for the 64-bit base ISA), similarly there are RV32D and RV64D. RV32D has the particularity of adding 64-bit wide floating-point registers when the associated general purpose registers are only 32-bit wide. This flexibility of RISC-V ISA is reviewed in the post RISC-V Register Files.

Initially, smaller floaitng-point formats, such as half precision, were not supported.

Half precision used to be lagging behind in term of ISA and hardware support. It was only specified has a storage format in the 2008 revision of the IEEE-754 standard (the IEEE standard specifying floating-point formats and operations). The momentum of deep learning and convolutional neural networks has kick-started a renewed interest for small number formats and in particular half precision (among many others).

Note: In the IEEE-754 2019 revision, half precision is still not defined has a basic format but as an interchange format. Although the standard is more permissive in terms of which formats can admit arithmetic operations.

RISC-V International (the association behind the RISC-V specification(s)) recently ratified two extensions specifying sets of instructions for half-precision support: Zfh and Zfhmin. The later being defined as a subset of the former, we will review Zfh first.

Zfh: full half-precision support

Zfh extends floating-point registers (FRF) to support half precision. It requires the F extension, thus no new register file, nor any register width extension is required since the floating-point registers are already wide enough to contain single precision values.

As other single-precision values in RV32D/RV64D, half precision values are stored in larger FLEN registers using RISC-V NaN boxing scheme (more on that in a future post).

This new extension defines:

Instruction to move data (with or without conversions):

flh: load from memory into an F register
fsh: store to memory from an F register
fmv.h.x, fmv.x.h: bit pattern move between X and F register files (unmodified)
fcvt.h.(w/l)[u], fcvt.(w/l)[u].h: conversions between X and F register files (from/to integer)

Arithmetic operations: fadd.h, fsub.h, fmul.h, fsqrt.h, fdiv.h, fmin.h, fmax.h, f(n)madd/f(n)msub.h
Floating-point comparisons: fcmp.h
Conversions between half precision and other floating-point formats
Miscellaneous: fclass.h, fsgn(/n/x).h

All arithmetic operations operate on uniform I/Os: all operands are half precision values, and the output is an half precision result. They share their opcode with the corresponding F/D instructions: the fmt fields (bits 25 and 26) value 2b'10 (2) encodes half precision (value 2b'00 for single precision, 2b'01 for double, and 2b'11 for quad but who need such a precision !).

The following diagram represents the data move instructions:

The availability of move to/from and conversion with 64-bit operands is conditioned on the availability of RV64 (general purpose move and integer conversions) and D extension (floating-point conversions).

Although certainly anecdotal, the availability of the Q extension (floating-point quad precision, a.k.a 128-bit format): instructions for conversions from/to quad precision are defined: fcvt.h.q and fcvt.q.h.

Zfhmin: reduced half-precision support

The extension Zfhmin can be seen as a subset of Zfh. It only mandates support for half-precision load and store, bit pattern moves from/to integer register file (no conversions with integer formats) and conversions with other floating-point formats. It represents a total of 6 instructions (extended to 8 with conversion from/to double precision format if the D extension is supported).

Zfhmin constitutes a reduced set of instruction which can be used by platforms where computing with half-precision values directly is not required but which still require the capability to manipulate them, in particular in memory, before converting them to a larger format for an eventual computation.

Vector support for half precision: Zvfh and Zvfhmin

The RISC-V Vector extension (RVV) version 1.0 specified vector support for single and double precision (SEW=32 and 64 bits). A draft of the specification (link to source) introduces Zvfh which is an extension of all the floating-point instruction to half-precision , including conversions with other formats, and Zvfhmin which is a really reduced subset of operations. As is the case for other floating-point format, half precision is supported by a specific value of the vsew field of the vtype configuration register: vsew=1. This encoding corresponding to a Selected Element Width (SEW) of 16 bits and is identical for both integer and floating point formats.

Zvfhmin only mandates the support of conversion between half and single precision: it extends the support of vfwcvt.f.f.v (widening half-to-float) and vfncvt.f.f.w (narrowing float-to-half) to SEW=16-bit.

Zvfh extends to half-precision all the floating-point vector operations, floating-point reductions, floating-point moves., a brief description of those operations can be found on Part 2 of our series RISC-V Vector Extension in a Nutshell.

Both Zvfhmin and Zvfh mandates the support of single precision element in vectors. On top of this support Zvfh mandates at least Zfhmin on the scalar floating-point side.

Half precision in RVA22 profile

The RISC-V consortium defines profiles. These profiles aim at defining a common set of mandatory extensions and a reduced set of optional extensions which can be used by hardware and software providers to build a compatible ecosystem without having to deal with more specialized ISA extensions. Profile descriptions can be found on RISC-V github.
RVA22 is the most recent profile, it is dedicated for 64-bit application processors
Zfhmin is part of the mandatory extensions of the RVA22 profile, while Zfh is an optional extension (which supersede Zfhmin when selected). This means that all application processors targeting compatibility with the RISC-V ecosystem must have a minimal support for half precision, and than extended support is part of the extended profile. Neither of the vector extensions Zvfh nor Zvfhmin are required in the RVA22 profile.

Conclusion

RISC-V Half-precision support in scalar operations has already been ratified (as extensions Zfh and Zfhmin) and his part of the latest application profile (RVA22). There exist draft specifications for the support of half-precision in vector operations: Zvfh and Zvfhmin. Other formats, such as BFloat16 or more esoteric number formats, should follow in the coming years.

RISC-V is an open community so do not hesitate to sign-in to stay up-to-date and participate to the effort: http://riscv.org.

Reference:

RISC-V specification: "Zfh" and "Zfhmin" Standard Extensions for Half-Precision Floating-Point
RISC-V RVA22 profiles https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc#6-rva22-profiles
RISC-V vector draft specification for Zvfh section
RISC-V vector draft specification for Zvfhmin section

Moving to substack

RISC-V Vector Element Groups

Definition

Constraints on vl and vstart

Masking and element groups

Examples and use cases

Difference between element groups and larger SEW

Conclusion

References:

RISC-V Compressed Instructions (part 2): Zc extensions

RISC-V Zc* extension

Overview

Zcb extension

Zcmp: push, pop and paired register atomic move

Zcmt: table jump

Encoding and decoding

Compatibility

Conclusion

References:

RISC-V Compressed Instructions (part 1): C extension

RISC-V C extension

CI-type: compressed operations with immediate

CA-type: compressed arithmetic instructions

Compressed load and stores

Control flow instructions: compressed jump and branch

Example of compressed assembly snippet

Conclusion

References:

RISC-V Vector Register Groups

LMUL encoding in vtype

LMUL vs EMUL

Benefits of register grouping

Constraints induced by register grouping

Conclusion

Resources

How to read a RISC-V Vector assembly instruction

Overview

Mnemonic

Operand type(s)

Destination and source(s)

Mask operand

Conclusion

Reference:

Support of half precision floating-point numbers in RISC-V ISA: Zfh and Zfhmin

Floating-point support in RISC-V ISA

Zfh: full half-precision support

Zfhmin: reduced half-precision support

Vector support for half precision: Zvfh and Zvfhmin

Half precision in RVA22 profile

Conclusion

Reference: