New Operation/Instruction for Fast GCM multiplication
In this article we will study how we could compute GCM multiplication with only 3 instructions (and two exclusive 64-bit or) by extending a very interesting white paper by Intel: Intel ® Carry-Less Multiplication Instruction and its Usage for computing the GCM Mode ([4]).
We will first introduce GCM, then survey the basic of the carry less multiplication and the Karatsuba algorithm for multiplication before detailing the modulo reduction method of [4]. Then we will introduce our new operation and evaluate its hardware implementation.
Galois Counter Mode
Galois-Counter Mode or GCM ([1]) is a block cipher mode of operation which uses a multiplication in a finite field (Galois Field or GF) as basis for the authentication procedure. The Galois Field used is GF(2)[X]/X128+X7+X2+1.
The fact that the irreducible polynomial used to build the field has a very low number of non-zero terms is very useful to implement a fast modulo reduction.
Carry Less Multiplication
In the following most binary values encode polynomial over GF(2)[X]. For example a 64-bit value A encodes A=∑63i=0ai.Xi where a_i is the i-th bit of A (and a coefficient in GF(2)). The addition of two polynomials A and B is simply their binary xor: A+B=∑63i=0(ai+bimod2)=∑63i=0(ai⊕bi)=A⊕B.
A multiplication of polynomials over $GF(2)[X] is called a Carry Less Multiplication because there are no carry propagation between monomial term of distinct degrees.
The following listing shows a possible implementation of a Carry-Less multiplication of two 64-bit polynomials, giving a 128-bit result. As we operate on polynomials, the accumulate operation is a XOR which contrary to an addition does not propagate a carry.
The following listing shows a possible implementation of a Carry-Less multiplication of two 64-bit polynomials, giving a 128-bit result. As we operate on polynomials, the accumulate operation is a XOR which contrary to an addition does not propagate a carry.
__int128 poly_mul_64x64_128(uint64_t a, uint64_t b) { int i; __int128 result = 0; for (i = 0; i < 64; ++i) { // result += b . X^i if a_i is non-zero result ^= ((a >> i) & 1) ? (b << i) : 0; } }
Recently, in x86 architecture, the carry-less multiplication operation is implemented by a specific instruction: PCLMULQDQ ([2]).
Assuming the availability of a 64−bit×64−bit→128−bit carry-less operation, a full GCM reduction can be performed by first computing the full 128×128 CLM using 4 64×64 multiply operations and then performing a modG reduction on the 128 most significant bits of the multiplication result.
Fast GCM Multiplication
A very interesting white paper by intel [4] sums up a lot of good ideas on how to implement a fast GCM multiplication.The proposal is two-fold:
- Use Karatsuba algorithm [5] to accelerate the multiplication
- Use the structure of the field to accelerate the reduction part
Karatsuba multiplication
We wish to compute M=A×B, where A and B are 128-bit polynomial (degree 127 in GF(2)[X]). Thus M is at of degree 255. Let us split A and B into high and low part:
A=Ahi.X64+Alo,B=Bhi.X64+Blo
M=Ahi×Bhi.X128+(Ahi×Blo+Alo×Bhi).X64+Alo×Blo
Assuming C=Ahi×Bhi, and D=Alo×Blo and E=(Ahi+Alo) and F=Bhi+Blo, (Ahi×Blo+Alo×Bhi) can be writen as (Ahi×Blo+Alo×Bhi)=E×F−C−D.
As we are working in GF(2)[X], the operation + is equivalent to −: both are binary xor.
We went from using 4 half multiplications in the standard algorithm to only 3 with Karatsuba (while appending a few additions). Additions are considered cheap compared to multiplications (and they generally are) so it should not be an issue.
Fast Reduction
[4] provides the following algorithm (Algorithm 3) to compute, R=Mhi×Xt(modG), assuming an input Mhi and G an irreducible polynomial of degree t, s=max(degree(Mhi))+1 (for GCM, s=128):- Preprocessing: we compute g and q, g is the polynomial built by the t−1 least coefficient of G and q is the quotient of the division of Xt+s by G where s is the degree of the quotient of the division of Xt×M by G
- Compute A=Mhi×q,
- Multiply the s most significant terms of A by g
- Output the t least significant terms of the Step 3.
Let us apply this algorithm to the case of GCM: G=X128+X7+X2+X+1, g=X7+X2+X+1. We want to compute M=A×B, where A and B are polynomials of degree 127.
Let us first compute q such that X256=q×G+r with degree(r)<degree(G), it is easy to verify that X256=G2⊕x14+x4+x2+1, thus q=G.
We split M in 4 part of 64 term each: from most significant to least signifiant M3,M2,M1,M0, Mhi=[M3:M2]=M3.X64⊕M2.
M[G]=M1.X64⊕M0⊕((M3X64⊕M2).X128[G])
To get ((M3X64⊕M2).X128[G]), we use the algorithm from [4]:
A=(M3X64⊕M2)×q
As we only need the 128 most significan terms of A, Ahi we can compute them as follows:
Ahi=[M3:M2]⊕[M3:M2]>>(128−7)⊕[M3:M2]>>(128−2)⊕[M3:M2]>>(128−1)⊕[M3:M2]>>(128−0)=[M3:M2]⊕[0:M3]>>57⊕[0:M3]>>62⊕[0:M3]>>63=[M3:(M2⊕M3>>57⊕M3>>62⊕M3>>63)]
We now compute Rlo from R=Ahi×g
R=Ahi×g=Ahi×(X7+X2+X+1)=[Ahi,1:Ahi,0]<<7⊕[Ahi,1:Ahi,0]<<2⊕[Ahi,1:Ahi,0]<<1⊕[Ahi,1:Ahi,0]
Computing Ahi requires 3 64-bit right shifts and 3 64-bit exclusive-or operations.
Computing R from Ahi is a little more expensive: it requires 3 128-bit shifts and 3 128-bit exclusive or operations. Overall a hardware implementation of this reduction will require around 576 standard xor gates (static shifts being free).
M[G]=M1.X64⊕M0⊕((M3X64⊕M2).X128[G])
To get ((M3X64⊕M2).X128[G]), we use the algorithm from [4]:
A=(M3X64⊕M2)×q
As we only need the 128 most significan terms of A, Ahi we can compute them as follows:
Ahi=[M3:M2]⊕[M3:M2]>>(128−7)⊕[M3:M2]>>(128−2)⊕[M3:M2]>>(128−1)⊕[M3:M2]>>(128−0)=[M3:M2]⊕[0:M3]>>57⊕[0:M3]>>62⊕[0:M3]>>63=[M3:(M2⊕M3>>57⊕M3>>62⊕M3>>63)]
We now compute Rlo from R=Ahi×g
R=Ahi×g=Ahi×(X7+X2+X+1)=[Ahi,1:Ahi,0]<<7⊕[Ahi,1:Ahi,0]<<2⊕[Ahi,1:Ahi,0]<<1⊕[Ahi,1:Ahi,0]
Computing Ahi requires 3 64-bit right shifts and 3 64-bit exclusive-or operations.
Computing R from Ahi is a little more expensive: it requires 3 128-bit shifts and 3 128-bit exclusive or operations. Overall a hardware implementation of this reduction will require around 576 standard xor gates (static shifts being free).
Merging operation for GCM multiplication
We are now going to merge the operations described above and the Karatsuba algorithm. The idea is to suggest a "generic" operation that could be implemented as processor instruction to accelerate GCM multiplication with more efficiency than the expanded version described in [4].
- Compute O1=(Alo×Blo⊕(Alo×Blo.X64))[G]
- Compute O2=(Ahi×Bhi.X128⊕(Ahi×Bhi.X64))[G]⊕O1
- Compute (O3=((Ahi⊕Alo)×(Bhi⊕Blo).X64)[G]⊕O2
- Output the final result O3
Each step consists in:
Updated on January 3rd, 2018
- One 64−bit×64−bit carry-less multiplication
- One 128-bit modulo G reduction
- A few exclusive-or operations
The product operands for the third step (O3) may be computed in parallel to step 1 and 2, they consist in two 64-bit exclusive or operations.
We can factorize the previous steps into a new primitive: GCMRED.
GCMRED(A,B,α,β,C), with A and B 64-bit operand, C a 128-bit operand and α,β coefficients in {0,1,X64,X128} implements the following operations:
GCMRED(A,B,α,β,C)=(A×B).(α+β)[G]⊕C
We can then compact a full GCM multiplication into 3 consecutives application of GCMRED:
GCM(Afull,Bfull)=GCMRED(Ahi⊕Alo,Bhi⊕Blo,X64,0, GCMRED(Ahi,Bhi,X64,X128, GCMRED(Alo,Blo,1,X64,0)))
As you can see we do not need the full range of α and β: only 3 variants are required, namely: (X64,0), (X64,X128) and (1,X64). Thus only those 3 may be encoded as possible instructions.
We can factorize the previous steps into a new primitive: GCMRED.
GCMRED(A,B,α,β,C), with A and B 64-bit operand, C a 128-bit operand and α,β coefficients in {0,1,X64,X128} implements the following operations:
GCMRED(A,B,α,β,C)=(A×B).(α+β)[G]⊕C
We can then compact a full GCM multiplication into 3 consecutives application of GCMRED:
GCM(Afull,Bfull)=GCMRED(Ahi⊕Alo,Bhi⊕Blo,X64,0, GCMRED(Ahi,Bhi,X64,X128, GCMRED(Alo,Blo,1,X64,0)))
As you can see we do not need the full range of α and β: only 3 variants are required, namely: (X64,0), (X64,X128) and (1,X64). Thus only those 3 may be encoded as possible instructions.
Complexity of GCMRED
A possible implementation of GCMRED is illustrated by the figure below. There are 4 64-bit inputs: A, B ,Chi and Clo and 2 64-bit output corresponding to the high and low part of the output.
The desisgn consists in :
- A 64×64 Carry-Less Multiplication
- A 128-bit GCM modulo reducer
- 6 64-bit exclusive or operators
- 4 64-bit and operators (to zeroify signals)
- 2 64-bit 3-input multiplexers
A naive CLM implementations would require around 4096 and gates (with two 1-bit inputs and one 1-bit output) and about the same number of xor gates (give or take a few). The GCM modulo reducer as described in a previous section requires about 576 xor gates.
If we consider the silicon area of a xor gates is about twice those of a nand gate (generally used as reference for silicon design comparison), the silicon area of an and gate is 32 times a nand gate and the silicon area of a 3-input multiplexer (with one input to zero) to be 5 times a nan gate we can evaluate the overcost of GCMRED compare to a simple CLM multiplication to be roughly around 25%.
This delta may be under estimated as the area to implement a carry less multiply can be further reduced from the naive implementation by using the Karatsuba algorithm recursively to obtain a more efficient design.
Endianess and GCM
In the AES version of Galois Counter Mode, input are assumed to be received in reverse order (least significant bit correspond to the term X127 in a 128-bit value) which implies both inputs and outputs must be permuted before and after standard polynomial computations to ensure correct results. Such a permutation is not performed by GCMRED. The cost of supporting it may be limited to introducing a level of inputs multiplexers and a level of output multiplexers to select between standard I/Os and permuted ones (allowing only two permutations).
Conclusion
This small technical report suggests a small addition to the PCLMULQDQ design with which a 25% area increase will bring down the cost of computing a GCM multiplication to five operations (with a critical path length of 3).
Future works include: synthesising the design and a standard CLM using a standard silicon technology node for better comparison (with critical path and latency analysis), comparing more accurately a full GCM implementation using the new GCMRED instruction operation with Intel's best implementation (including folding and factorized bit reversing).
Updated on January 3rd, 2018
Thank you to Julien LM and Hugues de LSG and Arnaud O for pointing many mistakes in the first versions.
Second update on August 27th, 2018, fixing typo in GCMRED description
Second update on August 27th, 2018, fixing typo in GCMRED description
References:
- [1] Wikipedia's page on GCM
- [2] Intel's Intrinsic guide page on PCLMULQDQ
- [3] Intel's whitepaper on efficient GCM implementation on Intel Architecture (pdf)
- [4] Intel's whitepaper: Intel ® Carry-Less Multiplication Instruction and its Usage for computing the GCM Mode
- [5] Wikipedia's page on Karatsuba Multiplication algorithm
No comments:
Post a Comment