Coding Illness: December 2017

New Operation/Instruction for Fast GCM multiplication

In this article we will study how we could compute GCM multiplication with only 3 instructions (and two exclusive 64-bit or) by extending a very interesting white paper by Intel: Intel ® Carry-Less Multiplication Instruction and its Usage for computing the GCM Mode ([4]).

We will first introduce GCM, then survey the basic of the carry less multiplication and the Karatsuba algorithm for multiplication before detailing the modulo reduction method of [4]. Then we will introduce our new operation and evaluate its hardware implementation.

Galois Counter Mode

Galois-Counter Mode or GCM ([1]) is a block cipher mode of operation which uses a multiplication in a finite field (Galois Field or $GF$) as basis for the authentication procedure. The Galois Field used is $ GF(2)[X] \!/\!_{X^{128} + X^7 + X ^2 + 1}$. The fact that the irreducible polynomial used to build the field has a very low number of non-zero terms is very useful to implement a fast modulo reduction.

Carry Less Multiplication

In the following most binary values encode polynomial over $GF(2)[X]$. For example a 64-bit value $A$ encodes $A=\sum_{i=0}^{63}a_i . X^i$ where a_i is the i-th bit of A (and a coefficient in $GF(2)$). The addition of two polynomials $A$ and $B$ is simply their binary xor: $A + B = \sum_{i=0}^{63}(a_i + b_i \mod 2) = \sum_{i=0}^{63}(a_i \oplus b_i ) = A \oplus B $.

A multiplication of polynomials over $GF(2)[X] is called a Carry Less Multiplication because there are no carry propagation between monomial term of distinct degrees.
The following listing shows a possible implementation of a Carry-Less multiplication of two 64-bit polynomials, giving a 128-bit result. As we operate on polynomials, the accumulate operation is a XOR which contrary to an addition does not propagate a carry.

__int128 poly_mul_64x64_128(uint64_t a, uint64_t b)
{
   int i;
   __int128 result = 0;
   for (i = 0; i < 64; ++i) {
       // result +=  b . X^i if a_i is non-zero
       result ^= ((a >> i) & 1) ? (b << i) : 0;
   }

}

Recently, in x86 architecture, the carry-less multiplication operation is implemented by a specific instruction: PCLMULQDQ ([2]).

Assuming the availability of a $64-bit \times 64-bit \rightarrow 128-bit $ carry-less operation, a full GCM reduction can be performed by first computing the full $128 \times 128$ CLM using 4 $64 \times 64$ multiply operations and then performing a $mod G$ reduction on the $128$ most significant bits of the multiplication result.

Fast GCM Multiplication

A very interesting white paper by intel [4] sums up a lot of good ideas on how to implement a fast GCM multiplication.
The proposal is two-fold:

Use Karatsuba algorithm [5] to accelerate the multiplication
Use the structure of the field to accelerate the reduction part

This white paper contains other interesting idea (folding ...) that will not be detailed in this article.

Karatsuba multiplication

We wish to compute $M = A \times B $, where $A$ and $B$ are 128-bit polynomial (degree 127 in $GF(2)[X]$). Thus M is at of degree 255. Let us split A and B into high and low part:

$$ A = A_{hi}. X^{64} + A_{lo}, B = B_{hi}. X^{64} + B_{lo}$$

$$ M = A_{hi} \times B_{hi} . X^{128} + (A_{hi} \times B_{lo} + A_{lo} \times B_{hi}) . X^{64} + A_{lo} \times B_{lo} $$

Assuming $ C = A_{hi} \times B_{hi}$, and $ D = A_{lo} \times B_{lo}$ and $E = (A_{hi} + A_{lo}) $ and $F = B_ {hi} + B_{lo}$, $ (A_{hi} \times B_{lo} + A_{lo} \times B_{hi}) $ can be writen as $$ (A_{hi} \times B_{lo} + A_{lo} \times B_{hi}) = E \times F - C - D $$.

As we are working in $GF(2)[X]$, the operation $+$ is equivalent to $-$: both are binary $xor$.

We went from using 4 half multiplications in the standard algorithm to only 3 with Karatsuba (while appending a few additions). Additions are considered cheap compared to multiplications (and they generally are) so it should not be an issue.

Fast Reduction

[4] provides the following algorithm (Algorithm 3) to compute, $R=M_{hi} \times X^t (\mod G)$, assuming an input $M_{hi}$ and $G$ an irreducible polynomial of degree $t$, $s=max(degree( M_{hi}))+1$ (for GCM, $s=128$):

Preprocessing: we compute $g$ and $q$, $g$ is the polynomial built by the $t-1$ least coefficient of $G$ and $q$ is the quotient of the division of $X^{t+s}$ by $G$ where $s$ is the degree of the quotient of the division of $X^t \times M$ by $G$
Compute $A = M_{hi} \times q$,
Multiply the $s$ most significant terms of $A$ by $g$
Output the $t$ least significant terms of the Step 3.

Let us apply this algorithm to the case of GCM: $G = X^{128} + X^7 + X^2 + X + 1$, $g=X^7 + X^2 + X + 1$. We want to compute $M = A \times B $, where $A$ and $B$ are polynomials of degree 127.

Let us first compute $q$ such that $X^{256}=q \times G + r $ with $degree(r) < degree(G)$, it is easy to verify that $X^{256}=G^2 \oplus x^{14}+x^4+x^2+1 $, thus $q = G$.

We split M in 4 part of 64 term each: from most significant to least signifiant $M_3, M_2, M_1, M_0$, $M_{hi} =[ M_3 : M_2] = M_3 . X^{64} \oplus M_2 $.

$$ M [G] = M_1 . X^{64} \oplus M_0 \oplus ((M_3 X^{64} \oplus M_2) . X^{128} [G]) $$

To get $((M_3 X^{64} \oplus M_2) . X^{128} [G])$, we use the algorithm from [4]:
$$ A = (M_3 X^{64} \oplus M_2) \times q $$
As we only need the 128 most significan terms of $A$, $A_{hi}$ we can compute them as follows:

\begin{equation}
\begin{split}
A_{hi} & = [M_3 : M_2] \oplus [M_3 : M_2 ] >> (128 - 7) \oplus [M_3 : M_2 ] >> (128 - 2) \\
&\oplus [M_3 : M_2 ] >> (128 - 1) \oplus [M_3 : M_2 ] >> (128 - 0) \\
& = [M_3 : M_2] \oplus [0 : M_3 ] >> 57 \oplus [0 : M_3 ] >> 62 \oplus [0 : M_3] >> 63 \\
& = [M_3 : (M_2 \oplus M_3 >> 57 \oplus M_3 >> 62 \oplus M_3 >> 63)]
\end{split}
\end{equation}

We now compute $R_{lo}$ from $ R = A_{hi} \times g $
\begin{equation}
\begin{split}
R &= A_{hi} \times g \\
&= A_{hi} \times (X^7 + X ^2 + X + 1) \\
&= [A_{hi,1} : A_{hi,0}] << 7 \oplus [A_{hi,1} : A_{hi,0}] << 2 \oplus [A_{hi,1} : A_{hi,0}] << 1 \oplus [A_{hi,1} : A_{hi,0}]
\end{split}
\end{equation}

Computing $A_{hi}$ requires 3 64-bit right shifts and 3 64-bit exclusive-or operations.
Computing $R$ from $A_{hi}$ is a little more expensive: it requires 3 128-bit shifts and 3 128-bit exclusive or operations. Overall a hardware implementation of this reduction will require around 576 standard xor gates (static shifts being free).

Merging operation for GCM multiplication

We are now going to merge the operations described above and the Karatsuba algorithm. The idea is to suggest a "generic" operation that could be implemented as processor instruction to accelerate GCM multiplication with more efficiency than the expanded version described in [4].

Compute $O_1 = (A_{lo} \times B_{lo} \oplus (A_{lo} \times B_{lo} . X^{64})) [G] $
Compute $O_2 = (A_{hi} \times B_{hi} . X^{128} \oplus (A_{hi} \times B_{hi} . X^{64})) [G] \oplus O_1 $
Compute $(O_3 = ((A_{hi} \oplus A_{lo}) \times (B_{hi} \oplus B_{lo}) . X^{64}) [G] \oplus O_2 $
Output the final result $O_3 $

Each step consists in:

One $64-bit \times 64-bit$ carry-less multiplication
One 128-bit modulo $G$ reduction
A few exclusive-or operations

The product operands for the third step ($O_3$) may be computed in parallel to step 1 and 2, they consist in two 64-bit exclusive or operations.
We can factorize the previous steps into a new primitive: GCMRED.

$GCMRED(A, B, \alpha, \beta, C)$, with $A$ and $B$ 64-bit operand, $C$ a 128-bit operand and $\alpha, \beta$ coefficients in $\{0, 1, X^{64}, X^{128}\}$ implements the following operations:
$$ GCMRED(A, B, \alpha, \beta, C) = (A \times B) . (\alpha + \beta) [G] \oplus C $$
We can then compact a full GCM multiplication into 3 consecutives application of $GCMRED$:

\begin{equation*}
\begin{split}
GCM(A_{full}, B_{full}) &= GCMRED(A_{hi} \oplus A_{lo}, B_{hi} \oplus B_{lo}, X^{64}, 0, \\
& \ \ \ \ \ \ GCMRED(A_{hi} , B_{hi} , X^{64}, X^{128}, \\
& \ \ \ \ \ \ \ \ \ GCMRED(A_{lo} , B_{lo} , 1, X^{64}, 0 )))
\end{split}
\end{equation*}

As you can see we do not need the full range of $\alpha$ and $\beta$: only 3 variants are required, namely: $(X^{64}, 0)$, $(X^{64}, X^{128})$ and $(1, X^{64})$. Thus only those 3 may be encoded as possible instructions.

Complexity of GCMRED

A possible implementation of GCMRED is illustrated by the figure below. There are 4 64-bit inputs: $A$, $B$ ,$C_{hi}$ and $C_{lo}$ and 2 64-bit output corresponding to the high and low part of the output.

The desisgn consists in :

A $64 \times 64$ Carry-Less Multiplication
A 128-bit GCM modulo reducer
6 64-bit exclusive or operators
4 64-bit and operators (to zeroify signals)
2 64-bit 3-input multiplexers

A naive CLM implementations would require around 4096 $and$ gates (with two 1-bit inputs and one 1-bit output) and about the same number of $xor$ gates (give or take a few). The GCM modulo reducer as described in a previous section requires about 576 $xor$ gates.

If we consider the silicon area of a xor gates is about twice those of a nand gate (generally used as reference for silicon design comparison), the silicon area of an $and$ gate is $ \frac{3}{2}$ times a nand gate and the silicon area of a 3-input multiplexer (with one input to zero) to be 5 times a nan gate we can evaluate the overcost of GCMRED compare to a simple CLM multiplication to be roughly around $25\%$.

This delta may be under estimated as the area to implement a carry less multiply can be further reduced from the naive implementation by using the Karatsuba algorithm recursively to obtain a more efficient design.

Endianess and GCM

In the AES version of Galois Counter Mode, input are assumed to be received in reverse order (least significant bit correspond to the term $X^{127}$ in a 128-bit value) which implies both inputs and outputs must be permuted before and after standard polynomial computations to ensure correct results. Such a permutation is not performed by GCMRED. The cost of supporting it may be limited to introducing a level of inputs multiplexers and a level of output multiplexers to select between standard I/Os and permuted ones (allowing only two permutations).

Conclusion

This small technical report suggests a small addition to the PCLMULQDQ design with which a 25% area increase will bring down the cost of computing a GCM multiplication to five operations (with a critical path length of 3).

Future works include: synthesising the design and a standard CLM using a standard silicon technology node for better comparison (with critical path and latency analysis), comparing more accurately a full GCM implementation using the new GCMRED instruction operation with Intel's best implementation (including folding and factorized bit reversing).

Updated on January 3rd, 2018

Thank you to Julien LM and Hugues de LSG and Arnaud O for pointing many mistakes in the first versions.
Second update on August 27th, 2018, fixing typo in GCMRED description

Coding Illness

New instruction for Fast GCM multiplication