Showing posts with label isa. Show all posts
Showing posts with label isa. Show all posts

Support of half precision floating-point numbers in RISC-V ISA: Zfh and Zfhmin

Floating-point support in RISC-V ISA

RISC-V does not mandate the support of any floating-point operations as part of the base ISA (RV32I and RV64I) but single precision and double-precision extensions (e.g. RV64F and RV64D) were among the first to be specified. The F extension added 32 floating-point registers, floating-point arithmetic operations, moves and conversions instructions. The D extension (which requires the F extension) extended this set of operation to also support double precision. There also exists a Q extension for quad-precision. 

In reality, there are multiple F and D extensions: RV32F (to extend the 32-bit base ISA) and RV64F (for the 64-bit base ISA), similarly there are RV32D and RV64D. RV32D has the particularity of adding 64-bit wide floating-point registers when the associated general purpose registers are only 32-bit wide. This flexibility of RISC-V ISA is reviewed in the post RISC-V Register Files.

Initially, smaller floaitng-point formats, such as half precision, were not supported. 

Half precision used to be lagging behind in term of ISA and hardware support. It was only specified has a storage format in the 2008 revision of the IEEE-754 standard (the IEEE standard specifying floating-point formats and operations). The momentum of deep learning and convolutional neural networks has kick-started a renewed interest for small number formats and in particular half precision (among many others). 

Note: In the IEEE-754 2019 revision, half precision is still not defined has a basic format but as an interchange format. Although the standard is more permissive in terms of which formats can admit arithmetic operations.

RISC-V International (the association behind the RISC-V specification(s)) recently ratified two extensions specifying sets of instructions for half-precision support: Zfh and Zfhmin. The later being defined as a subset of the former, we will review Zfh first.

Zfh: full half-precision support

Zfh extends floating-point registers (FRF) to support half precision. It requires the F extension, thus no new register file, nor any register width extension is required since the floating-point registers are already wide enough to contain single precision values.
As other single-precision values in RV32D/RV64D, half precision values are stored in larger FLEN registers using RISC-V NaN boxing scheme (more on that in a future post). 

This new extension defines:
  • Instruction to move data (with or without conversions):
    • flh: load from memory into an F register
    • fsh: store to memory from an F register
    • fmv.h.x, fmv.x.h: bit pattern move between X and F register files (unmodified)
    • fcvt.h.(w/l)[u], fcvt.(w/l)[u].h: conversions  between X and F register files (from/to integer)
  • Arithmetic operations: fadd.h, fsub.h, fmul.h, fsqrt.h, fdiv.h, fmin.h, fmax.h, f(n)madd/f(n)msub.h
  • Floating-point comparisons: fcmp.h
  • Conversions between half precision and other floating-point formats
  • Miscellaneous: fclass.hfsgn(/n/x).h

All arithmetic operations operate on uniform I/Os: all operands are half precision values, and the output is an half precision result. They share their opcode with the corresponding F/D instructions: the fmt fields (bits 25 and 26) value 2b'10 (2) encodes half precision (value 2b'00 for single precision, 2b'01 for double, and 2b'11 for quad but who need such a precision !).


The following diagram represents the data move instructions:



The availability of move to/from and conversion with 64-bit operands is conditioned on the availability of RV64 (general purpose move and integer conversions) and D extension (floating-point conversions).
Although certainly anecdotal, the availability of the Q extension (floating-point quad precision, a.k.a 128-bit format): instructions for conversions from/to quad precision are defined: fcvt.h.q and fcvt.q.h.

Zfhmin: reduced half-precision support

The extension Zfhmin can be seen as a subset of Zfh. It only mandates support for half-precision load and store, bit pattern moves from/to integer register file (no conversions with integer formats) and conversions with other floating-point formats. It represents a total of 6 instructions (extended to 8 with conversion from/to double precision format if the D extension is supported).
    Zfhmin constitutes a reduced set of instruction which can be used by platforms where computing with half-precision values directly is not required but which still require the capability to manipulate them, in particular in memory, before converting them to a larger format for an eventual computation.

Vector support for half precision: Zvfh and Zvfhmin

The RISC-V Vector extension (RVV) version 1.0 specified vector support for single and double precision (SEW=32 and 64 bits). A draft of the specification (link to source) introduces Zvfh which is an extension of all the floating-point instruction to half-precision , including conversions with other formats, and Zvfhmin which is a really reduced subset of operations. As is the case for other floating-point format, half precision is supported by a specific value of the vsew field of the vtype configuration register: vsew=1. This encoding corresponding to a Selected Element Width (SEW) of 16 bits and is identical for both integer and floating point formats.

Zvfhmin only mandates the support of conversion between half and single precision: it extends the support of vfwcvt.f.f.v (widening half-to-float) and vfncvt.f.f.w (narrowing float-to-half) to SEW=16-bit.

Zvfh extends to half-precision all the floating-point vector operations, floating-point reductions, floating-point moves., a brief description of those operations can be found on Part 2 of our series RISC-V Vector Extension in a Nutshell.

Both Zvfhmin and Zvfh mandates the support of single precision element in vectors. On top of this support Zvfh mandates at least Zfhmin on the scalar floating-point side.

Half precision in RVA22 profile

The RISC-V consortium defines profiles. These profiles aim at defining a common set of mandatory extensions and a reduced set of optional extensions which can be used by hardware and software providers to build a compatible ecosystem without having to deal with more specialized ISA extensions. Profile descriptions can be found on RISC-V github.

RVA22 is the most recent profile, it is dedicated for 64-bit application processors

Zfhmin is part of the mandatory extensions of the RVA22 profile, while Zfh is an optional extension (which supersede Zfhmin when selected). This means that all application processors targeting compatibility with the RISC-V ecosystem must have a minimal support for half precision, and than extended support is part of the extended profile. Neither of the vector extensions Zvfh nor Zvfhmin are required in the RVA22 profile.

Conclusion

RISC-V Half-precision support in scalar operations has already been ratified (as extensions Zfh and Zfhmin) and his part of the latest application profile (RVA22). There exist draft specifications for the support of half-precision in vector operations: Zvfh and Zvfhmin. Other formats, such as BFloat16 or more esoteric number formats, should follow in the coming years.
RISC-V is an open community so do not hesitate to sign-in to stay up-to-date and participate to the effort: http://riscv.org.

Reference:

RISC-V Register Files

RISC-V ISA defines several register files. There are at least 3 in the main set of extensions: the general purpose register file (XRF) introduced in the base integer extensions, the floating-point register file (FRF) introduced in the floating-point extensions and the vector register file (you guessed it, it was introduced in the vector extension a.k.a. RVV). We will not consider the control and status registers (CSR file) which have their own specificity (it is more common to split system registers from general purpose registers, although commonality could be debated).

Diagram of RISC-V register files and the operations between them


Register files characteristics

Each register file contains 32 architectural registers. The first register of the general purpose register file (x0) is a bit specific since its value is a hardwired constant: 0. The first register of the vector register file (v0) is the only one which can be used as the mask operand in RVV 1.0 (more information on RVV masked operation can be found in RISC-V Vector Extension in a Nutshell: part 3). 

The size of the registers in each file is an architecture parameter: XLEN for the general purpose register file, FLEN for the floating-point register file and VLEN for the vector register file. 

The general purpose register file is sized to fit a virtual address value. For example for the base 32-bit RV32I ISA, the general purpose registers are 32-bit wide, while they are 64-bit wide for RV64I (and 128-bit wide for RV128I, although this architecture is seldom used). For the vectore registers, VLEN must be a power of 2, greater or equal to ELEN (maximum element width supported by the implementation) and must not exceed 65536 (2^16). 

Moving data between register files

The diagram at the beginning of this post illustrates the base characteristics of each register file and the basic move operations between them and from/to memory. For the XRF and FRF, only operations on 32-bit values are drawn. There exists similar operation for double precisions: e.g. in RV64D when XLEN=FLEN=64 bits, which adds fldfsdfmv.d.xfmv.x.d and numerous conversions from integer format to double precision and reverse (e.g. in RV64IFD fcvt.d.wu f3, x2 corresponds to converting from an unsigned 32-bit integer in the bottom 32-bit of the x2 general purpose register to a double precision number stored in the f3 floating-point register).

For the vector register file, the scalar data size and the vector element size are not encoded as part of the opcode but are configured in the vsew field of the vtype configuration register, so there are no need for type specific vector moves. The diagram only represents explicit data moves between FRF/XRF and VRF but most vector instructions admit a vector-scalar variant which reads one of its operand directly from XRF or FRF (e.g. vfmadd.vf splats a scalar floating-point register as the multiplier). For more details on the vector registers and the vector extension in general you can refer to the series: RISC-V Vector extension in a Nutshell published on this blog.

Why multiple register files ?

This discussion focuses on the general purpose and floating-point register files. It is much more understandable to have a different vector register files (vector register tends to be larger than the other types of registers) although some ISAs (e.g. x86 SSE and AVX extensions) reuse small registers as the low parts of larger registers, this is not the case in RISC-V, where general purpose, floating-point and vector registers do not overlap. 

NOTE: The option of overlapping FRF and VRF was considered and dropped during the specification process of RVV. See https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#51-scalar-operands

Having multiple register files has some advantages and some drawbacks. Let's start with a benefit.

The first benefit is that the architecture can expose more architectural registers without extending the size of a register index in the instruction encoding. The type of instruction (.e.g. floating-point addition) is encoded in the opcode, and all (most) floating-point instructions implicitly operate on floating-point registers, thus there is no need to distinguish floating-point and general purpose registers in the opcode encoding of the register itself. The number of available architectural registers impact the register allocation pressure when writing (in assembly) or compiling a program: more registers often means less flexible ABI, less spilling.

In general, RISC-V ISA uses 5 bits to encode the register index (for operands and results), which provides 32 registers, since the type of register is part of the opcode specification rather than the register index, RISC-V architecture can in fact have 96 registers: 32 general purpose registers, 32 floating-point registers and 32 vector registers (you are authorized to say only 95 and exclude the specific x0, although having a hardwired 0-value operand is quite useful and certainly outweighs the benefit of an extra general purpose register in most cases).

This benefit can also be considered a drawback since more registers also means higher hardware cost for implementations and more registers to save in case of context switch. This was solved for RISC-V by introducing the Zfinx/Zdinx/Zhinx extensions which define floating-point operations working on general purpose registers (thus limiting the cost of floating-point support for the more constrained implementations: high performance out-of-order implementation with very large physical register files are generally less concerned by limiting the size of the architectural register files, although it can impact mapping tables, context sizes ...).

Another benefit of separate register files is that register can be encoded to optimize the data they store: for example the RISC-V open-source processor rocket-chip uses hardfloat's specific encoding of floating-point numbers (see Recoded Format section here) to makes floating-point operations more efficient (simplifying the detection of special values and reducing the encoding difference between normal and subnormal numbers). The use of a format specific encoding is facilitated by the fact that data moves are explicit (you need to execute a fmv.f.x to move the content of a general purpose register to a floating-point register before performing any operation on it): the recoding of value can be performed during the explicit data move (including from/to memory) and the recoding can exploit the fact that the value type is determined by the operation acting on it. Such recoding can only leave within an internal register and values must be converted to canonical formats when moved to another register file or to memory.

Once again this can also be considered a drawback since you have to explicitly move data from one register file to the next (which may have a non-zero latency and uses encoding space) and that you need to define separate memory operations for each register files: loading a 32-bit single precision number is not the same operation as loading a 32-bit integer value, since the destination registers differ. This was surveyed by the diagram and the section Moving data between register files. This drawback is more easily alleviated by wide out-of-order implementations which can extract more ILP and cover the cost of those explicit moves (although this cost still impacts latency dominated chains of instructions). It is also removed in the Z(f/d/h)inx extensions.

Another advantage of multiple register files is that you can tune the file's architecture characteristics to the domain you want to support. RISC-V does not expose a configurable number of registers but you XLEN, FLEN and VLEN are defined separately (the first two depending on which extensions are enabled): you can have a 32-bit ISA with 64-bit double precision registers: the architecture does not have to extend its integer registers to 64 bits, while still keeping 64 registers with adapted sizes: general purpose/integer instructions can be efficient and low power while the core only activates the FRF for workloads which require the extra activity.

Finally an implementation advantage, which was pointed out by a colleague (Alex S.): having several register files reduces the number of read / write ports per register file while being able to serve as many execution pipelines more efficiently. High performance cores often have a lot of ports on each register files (more than ten read ports may not be uncommon) that serve a lot of different execution pipelines in parallel. With the same total number of execution pipelines multiplying the number of architectural register file helps to split physical register files accordingly, decreasing the number of port per file. This is a good benefit as the complexity of a register file increases rapidly with the number of ports (in particular read ports). Even for low performance implementations, limiting the number of ports per file provide more efficient register files.

Conclusion

In this post we reviewed the main register files specified by RISC-V ISA, their basic characteristics and how they interact together. We listed some of the reasons for this choice alongside some of the drawbacks of specializing register files.

Initially published Oct 17th 2022, updated Oct 19th 2022.

Thanks

Thank you to Alex S. for pointing out a key advantage (decreasing the number of ports per register file) I missed in the first version of this blog post.

References: