RISC-V base ISAs (RV32I and RV64I) define 32-bit wide instructions. Those instructions follow the standard RISC instruction set architecture pattern: they fall into a limited set of possible encodings with opcode, register indices and immediate fields always being located in the same bit fields, making it easier to decode. RISC-V instructions can have up to 2 source registers and 1 destination register, with the possibility to use the destination as an extra source (e.g. fmadd). Each register is encoded on a 5-bit index: there are a total of 32 general purpose registers which can be used or written by an instruction (you can find more information on RISC-V register file in this post). This means that up to 15 bits of the instruction word may be dedicated to source/destination encoding, leaving the other 17 bits for immediate and function encodings.
Most programs will often rely heavily on a small subset of instructions which will have a very large static and dynamic frequency compared to all other instructions: static frequency counts the occurrences in the program binary (e.g. from an objdump) while dynamic frequency counts the occurrences during the execution (e.g. from an execution trace). Code structure (loop, functions, ...) and input values affect differently dynamic and static frequencies.
The fact that instructions appear with different frequencies can be exploited by ISA design to reduce code size by encoding the most used instructions on less bits than a general instruction. This can be beneficial for both low power and high performance targets: less memory required to store instructions, more instructions fit in caches, fetching N bytes provides more uops ... with the cost of a more complex decoding and with the caveat that instruction addresses are not aligned to a 32-bit boundary in memory.
Early on RISC-V was extended with compressed instruction extensions. And now multiple standard extensions offer compressed instructions (with encoding less than the initial 32 bits and/or fusing operations from multiple standard instructions). We have divided the survey of RISC-V compressed instructions in two posts: this post reviews some aspects of the first compressed extension, the C extension, while the second one will review the newer Zc* extensions.
RISC-V C extension
To allow code size improvements RISC-V base ISA was rapidly complemented with the Compressed extension, a.k.a the C extension. This extension is defined in Chapter 16 of the unprivileged specification (see pdf).
The C extension defines 9 new instruction formats, each of them fitting on 16-bit. It is not a standalone extensions but is built on top of RV32I or RV64I. It defines a little over 40 new instructions, with some of them only accessible for larger XLEN (RV64 or RV128) or limited to smaller XLEN (RV32FC) and some of them requiring the floating-point extension (F) or the double precision extension (D). Each instruction in the C-extension has an uncompressed counterpart (and can be expanded to this counterpart by the micro-architecture to simplify instruction decoding and execution).
In an assembly program the compressed instructions are easily distinguishable: they start with the prefix "c.", e.g. c.addi is the compressed add with immediate with a non-zero 6-bit immediate. We will review some of the compressed instruction formats and how they compare with their expanded counterparts: arithmetic operations with immediate, arithmetic operations with registers, load/store with focus on load from the stack.
CI-type: compressed operations with immediate
The following diagram illustrates the differences between the encoding of addi in the base ISA (top, 32-bit wide) and in the C extension (bottom, 16-bit wide): the immediate has been reduced from 12 to 6 bits, the same register index is used for both source and destination and the instruction operation is encoded on 5 bits rather than 10. One thing to note is that the register index is still 5-bit wide, all the registers can be addressed in this opcode. For c.addi as for other instances of the CI-type (e.g. compressed load word from stack pointer c.lwsp) the destination must be non-zero (else encoding is reserved). For c.addi, the sign-extended immediate must also be non-zero.
Obviously c.addi is less expressive than addi: smaller immediate range, source and destination have to be the same register. This affects all compressed instructions: they constitutes a subset of their expanded counterparts, hopefully the most useful one.
CA-type: compressed arithmetic instructions
Another compressed format is the CA-type (compressed arithmetic) illustrated by the following diagram:
In the CA-type, one of the source register index (rs1) is fused with destination (rd), and the register indices are now 3-bit wide. Furthermore they do not encode directly a register index, the indirection is presented in the following table (copied from the RVC standard): for example 0b100 encodes x12 in compressed instructions working on general purpose / integer registers. This is why the register indices are labelled with rd', rs1', and rs2' (rather than rd, rs1, and rs2). The encoded registers were selected because they are the most frequently used ones (the specification notes that the RISC-V ABI was even changed to ensure the 8 registers mapped in RVC were among the most used). The following table (copied from RISC-V unprivileged specification chapter 16 section 16.2 on the C extension instruction formats, page 100) provides the mapping between the 8 possible values of encoded register indices in the CA-type and the actual general purpose or floating-point registers, alongside the ABI register names.
Both previous examples of RVC encoding illustrates that compressed instructions are less expressive than their base extension counterparts and use larger parts of the opcode space. This is the price to pay to ensure shorter code and is balanced by the fact that they are more heavily used than the extended versions providing an overall code size reduction.
Compressed load and stores
The C extensions defines 2 types of compressed loads and stores.
The first type of compressed load uses a memory address based on the stack pointer (x2): c.[f]lwsp, c.[f]ldsp, c.lqsp. These instructions use a 6-bit offset scaled by the data size and added to the base address read in the stack pointer register. The destination register index is encoded on 5 bits and the encoding with rd=0 as the destination register is reserved. Since the base address register is always x2 there is no need to encode it in the instruction word. There is no floating-point version of the quad-word load. The store counterparts exist: c.[f]swsp, c[f]sdsp, c.sqsp.
The second type of compressed load/store uses a register-base address: 2 registers are encoded the base address rs1' and the load destination rd' (respectively store source rs2'). Each of those 2 registers is encoded on 3 bits and uses the reduced RVC Register Number encoding listed previously.
Both types of compressed loads and stores uses specific immediate encoding where bit order might look a bit scrambled. According to the specification, the immediate split into various bit fields and their ordering was chosen to ensure that most immediate bits could be found in the same location across various families of instructions.
The following diagram illustrates the difference between the encoding of the lw rd, offset(x2) (I-type, top part) and the c.lwsp rd, offset(sp) (CI type, bottom part). In the compressed encoding, the 5-bit offset is used to encode a 4-byte offset and is implicitly multiplied by 4 before being added to the stack pointer sp/x2. The 5th bit is stored in opcode position 12 as other instructions of type CI and the lower 5-bit immediate field stores the offset [4:2] has its most 3 significant bits and offset [7:6] has its 2 least significant bits. rd cannot be equal to 0 in the compressed instruction encoding.
Control flow instructions: compressed jump and branch
The compressed extension contains 6 control transfers instructions: 2 unconditional jumps with immediate offset (added to pc): c.j and c.jal, 2 unconditional jumps with register targets: c.jr and c.jalr and two branches: c.beqz and c.bnez. Unconditional jumps only exist with an immediate offset or with a direct register target and the branches only exist with an immediate branch offset and with a comparison to zero.
Example of compressed assembly snippet
Let us now consider an illustration of the impact of the C extension on a toy program: a vector add function.
Using godbolt.org compiler explorer (a wonderful tool) we can compile a RV64 program assuming no support for C extension, or assuming it is supported: compiler explorer result.
The example program is simply:
int vector_add(float* r, float* a, float* b, unsigned n) {
unsigned i = 0;
for (;i < n; ++i) r[i] = a [i] + b[i];
return n;
}
The first assembly was compiled by clang 15.0.0 for RV64 with options: "-O2 -march=rv64imfd".
(it also contains a 2-instruction main function)
vector_add:
beqz a3,6b4 <vector_add+0x30>
slli a4,a3,0x20
srli a4,a4,0x20
flw ft0,0(a1)
flw ft1,0(a2)
fadd.s ft0,ft0,ft1
fsw ft0,0(a0)
addi a1,a1,4
addi a2,a2,4
addi a4,a4,-1
addi a0,a0,4
bnez a4,690 <vector_add+0xc>
mv a0,a3
ret
The second assembly was compiled with the same compiler and options: "-O2 -march=rv64imfdc" (the only difference is the addition of the C extension).
vector_add:
beqz a3,6a6 <vector_add+0x22>
slli a4,a3,0x20
srli a4,a4,0x20
flw ft0,0(a1)
flw ft1,0(a2)
fadd.s ft0,ft0,ft1
fsw ft0,0(a0)
addi a1,a1,4
addi a2,a2,4
addi a4,a4,-1
addi a0,a0,4
bnez a4,68c <vector_add+0x8>
mv a0,a3
ret
The two assembly results can look very similar (and they mostly are) but one can notice a few differences. Let's start with the similarities: the instructions mnemonic and the registers used are identical: there are as many instructions in both listings (14) and they use the same registers (same sources and destinations).
On the front of differences: the immediate offset differs (on beqz and bnez): this is because the instruction encoding lengths differ. Let's look at the second compilation result annotated with the program addresses (left, in hexadecimal) and the opcodes (in grey, above the instruction):
Out of the 14 instructions, 9 are encoded on 2 bytes. Without the C extensions the code occupies 56 bytes, and only 38 when the C extension is enabled, representing a 32% saving. The main loop (program from address 0x68c to 0x6a4) contains 9 instructions, including 6 compressed instructions. The asymptotic dynamic saving with compressed instructions is similar to the static code size saving.
One may ask why the compressed version of flw (c.flw) has not been used by the compiler; the reason is simple: this instruction only exists for RV32FC, not for RV64FC.
Another fact is that the 2-byte code alignment requirement for branch target can be witnessed for the target of the first branch beqz which, when taken, branches to address 0x6a6, a 2-byte aligned, but not 4-byte aligned, address.
Conclusion
The C extension provides a reduced encoding for a subset of the base instructions. The subset was selected to cover the most used instructions in their most used form. The reduced encoding allows for code size improvements which benefit both low power and high performance implementations.
For a more in depth overview of the C -extension we invite you to read Andrew Waterman's master thesis report which contains many insights on the impact of compressed instructions on RISC-V program (link to pdf).
References: