Coding Illness: cryptography

Showing posts with label cryptography. Show all posts

New instruction for Fast GCM multiplication

New Operation/Instruction for Fast GCM multiplication

In this article we will study how we could compute GCM multiplication with only 3 instructions (and two exclusive 64-bit or) by extending a very interesting white paper by Intel: Intel ® Carry-Less Multiplication Instruction and its Usage for computing the GCM Mode ([4]).

We will first introduce GCM, then survey the basic of the carry less multiplication and the Karatsuba algorithm for multiplication before detailing the modulo reduction method of [4]. Then we will introduce our new operation and evaluate its hardware implementation.

Galois Counter Mode

Galois-Counter Mode or GCM ([1]) is a block cipher mode of operation which uses a multiplication in a finite field (Galois Field or $GF$) as basis for the authentication procedure. The Galois Field used is $ GF(2)[X] \!/\!_{X^{128} + X^7 + X ^2 + 1}$. The fact that the irreducible polynomial used to build the field has a very low number of non-zero terms is very useful to implement a fast modulo reduction.

Carry Less Multiplication

In the following most binary values encode polynomial over $GF(2)[X]$. For example a 64-bit value $A$ encodes $A=\sum_{i=0}^{63}a_i . X^i$ where a_i is the i-th bit of A (and a coefficient in $GF(2)$). The addition of two polynomials $A$ and $B$ is simply their binary xor: $A + B = \sum_{i=0}^{63}(a_i + b_i \mod 2) = \sum_{i=0}^{63}(a_i \oplus b_i ) = A \oplus B $.

A multiplication of polynomials over $GF(2)[X] is called a Carry Less Multiplication because there are no carry propagation between monomial term of distinct degrees.
The following listing shows a possible implementation of a Carry-Less multiplication of two 64-bit polynomials, giving a 128-bit result. As we operate on polynomials, the accumulate operation is a XOR which contrary to an addition does not propagate a carry.

__int128 poly_mul_64x64_128(uint64_t a, uint64_t b)
{
   int i;
   __int128 result = 0;
   for (i = 0; i < 64; ++i) {
       // result +=  b . X^i if a_i is non-zero
       result ^= ((a >> i) & 1) ? (b << i) : 0;
   }

}

Recently, in x86 architecture, the carry-less multiplication operation is implemented by a specific instruction: PCLMULQDQ ([2]).

Assuming the availability of a $64-bit \times 64-bit \rightarrow 128-bit $ carry-less operation, a full GCM reduction can be performed by first computing the full $128 \times 128$ CLM using 4 $64 \times 64$ multiply operations and then performing a $mod G$ reduction on the $128$ most significant bits of the multiplication result.

Fast GCM Multiplication

A very interesting white paper by intel [4] sums up a lot of good ideas on how to implement a fast GCM multiplication.
The proposal is two-fold:

Use Karatsuba algorithm [5] to accelerate the multiplication
Use the structure of the field to accelerate the reduction part

This white paper contains other interesting idea (folding ...) that will not be detailed in this article.

Karatsuba multiplication

We wish to compute $M = A \times B $, where $A$ and $B$ are 128-bit polynomial (degree 127 in $GF(2)[X]$). Thus M is at of degree 255. Let us split A and B into high and low part:

$$ A = A_{hi}. X^{64} + A_{lo}, B = B_{hi}. X^{64} + B_{lo}$$

$$ M = A_{hi} \times B_{hi} . X^{128} + (A_{hi} \times B_{lo} + A_{lo} \times B_{hi}) . X^{64} + A_{lo} \times B_{lo} $$

Assuming $ C = A_{hi} \times B_{hi}$, and $ D = A_{lo} \times B_{lo}$ and $E = (A_{hi} + A_{lo}) $ and $F = B_ {hi} + B_{lo}$, $ (A_{hi} \times B_{lo} + A_{lo} \times B_{hi}) $ can be writen as $$ (A_{hi} \times B_{lo} + A_{lo} \times B_{hi}) = E \times F - C - D $$.

As we are working in $GF(2)[X]$, the operation $+$ is equivalent to $-$: both are binary $xor$.

We went from using 4 half multiplications in the standard algorithm to only 3 with Karatsuba (while appending a few additions). Additions are considered cheap compared to multiplications (and they generally are) so it should not be an issue.

Fast Reduction

[4] provides the following algorithm (Algorithm 3) to compute, $R=M_{hi} \times X^t (\mod G)$, assuming an input $M_{hi}$ and $G$ an irreducible polynomial of degree $t$, $s=max(degree( M_{hi}))+1$ (for GCM, $s=128$):

Preprocessing: we compute $g$ and $q$, $g$ is the polynomial built by the $t-1$ least coefficient of $G$ and $q$ is the quotient of the division of $X^{t+s}$ by $G$ where $s$ is the degree of the quotient of the division of $X^t \times M$ by $G$
Compute $A = M_{hi} \times q$,
Multiply the $s$ most significant terms of $A$ by $g$
Output the $t$ least significant terms of the Step 3.

Let us apply this algorithm to the case of GCM: $G = X^{128} + X^7 + X^2 + X + 1$, $g=X^7 + X^2 + X + 1$. We want to compute $M = A \times B $, where $A$ and $B$ are polynomials of degree 127.

Let us first compute $q$ such that $X^{256}=q \times G + r $ with $degree(r) < degree(G)$, it is easy to verify that $X^{256}=G^2 \oplus x^{14}+x^4+x^2+1 $, thus $q = G$.

We split M in 4 part of 64 term each: from most significant to least signifiant $M_3, M_2, M_1, M_0$, $M_{hi} =[ M_3 : M_2] = M_3 . X^{64} \oplus M_2 $.

$$ M [G] = M_1 . X^{64} \oplus M_0 \oplus ((M_3 X^{64} \oplus M_2) . X^{128} [G]) $$

To get $((M_3 X^{64} \oplus M_2) . X^{128} [G])$, we use the algorithm from [4]:
$$ A = (M_3 X^{64} \oplus M_2) \times q $$
As we only need the 128 most significan terms of $A$, $A_{hi}$ we can compute them as follows:

\begin{equation}
\begin{split}
A_{hi} & = [M_3 : M_2] \oplus [M_3 : M_2 ] >> (128 - 7) \oplus [M_3 : M_2 ] >> (128 - 2) \\
&\oplus [M_3 : M_2 ] >> (128 - 1) \oplus [M_3 : M_2 ] >> (128 - 0) \\
& = [M_3 : M_2] \oplus [0 : M_3 ] >> 57 \oplus [0 : M_3 ] >> 62 \oplus [0 : M_3] >> 63 \\
& = [M_3 : (M_2 \oplus M_3 >> 57 \oplus M_3 >> 62 \oplus M_3 >> 63)]
\end{split}
\end{equation}

We now compute $R_{lo}$ from $ R = A_{hi} \times g $
\begin{equation}
\begin{split}
R &= A_{hi} \times g \\
&= A_{hi} \times (X^7 + X ^2 + X + 1) \\
&= [A_{hi,1} : A_{hi,0}] << 7 \oplus [A_{hi,1} : A_{hi,0}] << 2 \oplus [A_{hi,1} : A_{hi,0}] << 1 \oplus [A_{hi,1} : A_{hi,0}]
\end{split}
\end{equation}

Computing $A_{hi}$ requires 3 64-bit right shifts and 3 64-bit exclusive-or operations.
Computing $R$ from $A_{hi}$ is a little more expensive: it requires 3 128-bit shifts and 3 128-bit exclusive or operations. Overall a hardware implementation of this reduction will require around 576 standard xor gates (static shifts being free).

Merging operation for GCM multiplication

We are now going to merge the operations described above and the Karatsuba algorithm. The idea is to suggest a "generic" operation that could be implemented as processor instruction to accelerate GCM multiplication with more efficiency than the expanded version described in [4].

Compute $O_1 = (A_{lo} \times B_{lo} \oplus (A_{lo} \times B_{lo} . X^{64})) [G] $
Compute $O_2 = (A_{hi} \times B_{hi} . X^{128} \oplus (A_{hi} \times B_{hi} . X^{64})) [G] \oplus O_1 $
Compute $(O_3 = ((A_{hi} \oplus A_{lo}) \times (B_{hi} \oplus B_{lo}) . X^{64}) [G] \oplus O_2 $
Output the final result $O_3 $

Each step consists in:

One $64-bit \times 64-bit$ carry-less multiplication
One 128-bit modulo $G$ reduction
A few exclusive-or operations

The product operands for the third step ($O_3$) may be computed in parallel to step 1 and 2, they consist in two 64-bit exclusive or operations.
We can factorize the previous steps into a new primitive: GCMRED.

$GCMRED(A, B, \alpha, \beta, C)$, with $A$ and $B$ 64-bit operand, $C$ a 128-bit operand and $\alpha, \beta$ coefficients in $\{0, 1, X^{64}, X^{128}\}$ implements the following operations:
$$ GCMRED(A, B, \alpha, \beta, C) = (A \times B) . (\alpha + \beta) [G] \oplus C $$
We can then compact a full GCM multiplication into 3 consecutives application of $GCMRED$:

\begin{equation*}
\begin{split}
GCM(A_{full}, B_{full}) &= GCMRED(A_{hi} \oplus A_{lo}, B_{hi} \oplus B_{lo}, X^{64}, 0, \\
& \ \ \ \ \ \ GCMRED(A_{hi} , B_{hi} , X^{64}, X^{128}, \\
& \ \ \ \ \ \ \ \ \ GCMRED(A_{lo} , B_{lo} , 1, X^{64}, 0 )))
\end{split}
\end{equation*}

As you can see we do not need the full range of $\alpha$ and $\beta$: only 3 variants are required, namely: $(X^{64}, 0)$, $(X^{64}, X^{128})$ and $(1, X^{64})$. Thus only those 3 may be encoded as possible instructions.

Complexity of GCMRED

A possible implementation of GCMRED is illustrated by the figure below. There are 4 64-bit inputs: $A$, $B$ ,$C_{hi}$ and $C_{lo}$ and 2 64-bit output corresponding to the high and low part of the output.

The desisgn consists in :

A $64 \times 64$ Carry-Less Multiplication
A 128-bit GCM modulo reducer
6 64-bit exclusive or operators
4 64-bit and operators (to zeroify signals)
2 64-bit 3-input multiplexers

A naive CLM implementations would require around 4096 $and$ gates (with two 1-bit inputs and one 1-bit output) and about the same number of $xor$ gates (give or take a few). The GCM modulo reducer as described in a previous section requires about 576 $xor$ gates.

If we consider the silicon area of a xor gates is about twice those of a nand gate (generally used as reference for silicon design comparison), the silicon area of an $and$ gate is $ \frac{3}{2}$ times a nand gate and the silicon area of a 3-input multiplexer (with one input to zero) to be 5 times a nan gate we can evaluate the overcost of GCMRED compare to a simple CLM multiplication to be roughly around $25\%$.

This delta may be under estimated as the area to implement a carry less multiply can be further reduced from the naive implementation by using the Karatsuba algorithm recursively to obtain a more efficient design.

Endianess and GCM

In the AES version of Galois Counter Mode, input are assumed to be received in reverse order (least significant bit correspond to the term $X^{127}$ in a 128-bit value) which implies both inputs and outputs must be permuted before and after standard polynomial computations to ensure correct results. Such a permutation is not performed by GCMRED. The cost of supporting it may be limited to introducing a level of inputs multiplexers and a level of output multiplexers to select between standard I/Os and permuted ones (allowing only two permutations).

Conclusion

This small technical report suggests a small addition to the PCLMULQDQ design with which a 25% area increase will bring down the cost of computing a GCM multiplication to five operations (with a critical path length of 3).

Future works include: synthesising the design and a standard CLM using a standard silicon technology node for better comparison (with critical path and latency analysis), comparing more accurately a full GCM implementation using the new GCMRED instruction operation with Intel's best implementation (including folding and factorized bit reversing).

Updated on January 3rd, 2018

Thank you to Julien LM and Hugues de LSG and Arnaud O for pointing many mistakes in the first versions.
Second update on August 27th, 2018, fixing typo in GCMRED description

References:

Building an OpenSSL Engine: Part1

Let's start the engine

In Part 0, we learn how to developp a basic OpenSSL engine : a dynamic library which offloads cryptography computation from the crypto library.The second step is to use our engine.

OpenSSL is a complete solution, it provides:

libraries for cryptography (libcrypto), for secure connections management (libssl)
numerous command line utilities (e.g. certificate generations, encryption ..)

OpenSSL CLI (command line interface) is accessible through the "openssl" command. Three options to this command are of interest to us:

openssl engine which lists available engines or load a new engine
openssl dgst which computes a message digest from a given data source (stdin, input file ..)
openssl enc which encrypts or decrypts messages (which we will use in a later article)

We already saw in part 0 how to use openssl engine to check that our newly built engine could be loaded successfully. Let us now use it through openssl dgst .

$ echo abc | openssl dgst -sha1
(stdin)= 03cfd743661f07975fa2f1220c5194cbaff48451

This first example computes the SHA1 digest of the string "abc" using the standard implementation of the SHA1 algorithm provided by openssl. The string is forwarded through a pipe but it could have been read from a file (using -in <input filename> option ).
Let us now apply the same command with an extra -engine option to force openssl to rely on our own SHA1 implementation:

$ echo abc | openssl dgst -engine `pwd`/build/engine_ex.so -sha1

engine "engineX" set.

(stdin)= 03cfd743661f07975fa2f1220c5194cbaff48451

Happilly the digests match. You can also notice that our engine was set which means the openssl registered it as requested. (Note: To make sure OpenSSL in indeed using your engine you can introduce a small error in the digest computed by the engine (e.g. in the function sha1_final in "e_ex.c" ) to make sure it is visible in the result).

Accessing our engine in a C program

OpenSSL CLI is a powerful tools but if you want to developp your own efficient cryptography application you may be more inclined to access OpenSSL capabilities through its C API.
In the following sections we are going to look in details at the example tests/basic_digest.c, available on the article series github.
In those sections we will see several things: how to write an OpenSSL configuration file so OpenSSL can learn the existence of our engine and where to load it from, how to load this configuration file in a program using OpenSSL libraries (loading our engine with it), how to use explicit our engine as a digest implementation to compute a message digest and finally how to define our engine as the by default implementation of OpenSSL so each time a SHA1 digest is computed through the EVP API the computation is offloaded to our engine.

Writing an OpenSSL Configuration File

A clean way to load an engine from within a C program is to integrate the engine description to an OpenSSL configuration file and to load the configuration file in your application.
The following example is a configuration file which described how to load the engine built in the previous lesson:

openssl_conf            = openssl_def
[openssl_def]
engines = engine_section
[engine_section]
engine_x = engine_x_section
[engine_x_section]
engine_id = engineX
dynamic_path = ${ENV::PWD}/build/engine_ex.so 
init = 1

An interesting feature of OpenSSL configuration file is the possibility to include envrionemment (shell) variables with the ${ENV::<var name>} syntax.

Initializing OpenSSL and loading our engine

There are several calls necessary to initialize OpenSSL properly, a detailed explaination can be found in OpenSSL wiki. In this example we will simply make the basic calls to enable engine load and use:

// initializing OpenSSL library
OPENSSL_load_builtin_modules();
ERR_load_crypto_strings();
OpenSSL_add_all_algorithms();
ENGINE_load_dynamic();

Once the library is initialized we may load the configuration file:

// building OpenSSL's configuration file path
char openssl_cnf_path[] = "./openssl.cnf"; 

// loading configuration
if (CONF_modules_load_file(openssl_cnf_path, "openssl_conf", 0) != 1) {
  fprintf(stderr, "OpenSSL failed to load required configuration\n");
  ERR_print_errors_fp(stderr);
  return 1;
}

The configuration loading will make the library load the engine described in "openssl.cnf". Thereafter, it may be accessed as follows:

ENGINE* eng = ENGINE_by_id("engineX");
if(NULL == eng) {
  printf("failed to retrieve engine by id (mppa)\n");
  return 1;
}

Forcing the use of our engine in an explicit digest call

The handle of type Engine* may be used in a EVP_DigestInit_ex call to force the use of the engine.

unsigned char digest[20];
// message digest context declaration
EVP_MD_CTX sctx;
EVP_MD_CTX* ctx = &sctx;
EVP_MD_CTX_init(ctx);
// declaring the use of SHA-1 and specifying the
// engine as the implementation to use
EVP_DigestInit_ex(ctx, EVP_sha1(), eng);
// updating digest with input data
EVP_DigestUpdate(ctx, data, len);
unsigned int dlen = -1;
// finishing and retrieving digest
EVP_DigestFinal(ctx, digest, &dlen);

The previous examples (basic_digest.c) is available on the article serie github. It must be linked with libcrypto and libssl (e.g. using -lcrypto and -lopenssl link options).

Configuring our engine to be used as default

OpenSSL provides through the ENGINE API some functions to bind an engine as the default method for any algorithm including digest:

// defining our engine as the default implementations
// for digest algorithms
ENGINE_set_default_digests(eng);

// Initializing a SHA-1 digest context
// without specifying an implementation,
// thus falling back on the default method
EVP_DigestInit(ctx, EVP_sha1());

An other way to define our engine as the default method for algorithms it supports, is to add the line "default_algorithms = ALL" to our engine section in openssl configuration file :

openssl_conf            = openssl_def
[openssl_def]
engines = engine_section
[engine_section]
engine_x = engine_x_section
[engine_x_section]
engine_id = engineX
dynamic_path = ${ENV::PWD}/build/engine_ex.so 
default_algorithms = ALL
init = 1

The "ALL" option value associates our engine as the default implementation for all algorithms (restrain to all the algorithm the engine supports). The other possible values are: RSA, DSA, DH, EC, RAND, CIPHERS, DIGESTS, PKEY, PKEY_CRYPTO and PKEY_ASN1. As our engine only support a digest (SHA1), the values ALL and DIGESTS have the same outcome in our configuration : declaring our engine as the default implementation for SHA1.

Warning: some OpenSSL non standard methods (such as SHA256(...) ) bypass the EVP API and do not depend on the default implementation. Those methods only use the software implementation built-in OpenSSL and never the engines even if defined as default methods.

References

Wikipedia's page on SHA1 https://en.wikipedia.org/wiki/SHA-1
OpenSSL manual page on Engine API https://wiki.openssl.org/index.php/Manual:Engine%283%29
OpenSSL manual page on library initialization https://wiki.openssl.org/index.php/Library_Initialization
External documentation / example of OpenSSL configuration files : file https://www.nlnetlabs.nl/downloads/publications/hsm/hsm_node18.html
An other tutorial which talks about OpenSSL configuration files : https://www.sinodun.com/2009/02/developing-an-engine-for-openssl/

Building an OpenSSL Engine: Part0

Introduction

The OpenSSL library takes in charge most of the aspects of establishing and managing secure connections (certificate generation and storage, key exchange, asymmetric and symmetric cryptography ...). One of its component is the crypto library which contains implementations for the most current digest and cipher algorithms (at least those required by SSL and TLS).
Those algorithms are generally considered computationally intensive and some offloading solutions exist. They consist in using hardware acceleration to offload expensive cryptography computation from your server/workstation. OpenSSL supports such offloading through the engine API. This API permits the declaration and binding of external hardware to take care of digest / cipher and public key cryptography acceleration. The following blog posts will expose how to develop, configure and use your own OpenSSL engine.

Hands-on: Let's start coding a minimal engine

The following code snippets (widely inspired by Reference 2) is the minimal code you need to declare an engine. It just defines a engine name, id and a bind function.

/** file e_ex.c */

#include <openssl/engine.h>

/** Engine name and id */
static const char* engine_id   = "engineX";
static const char* engine_name = "Engine example";

static int bind(ENGINE* e, const char* id) 
{
  int ret = 0;
  if (!ENGINE_set_id(e, engine_id)) {
    fprintf(stderr, "ENGINE_set_id failed\n");
    goto end;
  }
  if (!ENGINE_set_name(e, engine_name)) {
    printf("ENGINE_set_name failed\n");
    goto end;
  }

  ret = 1;
end:
  return ret;
}

IMPLEMENT_DYNAMIC_BIND_FN(bind)
IMPLEMENT_DYNAMIC_CHECK_FN()

OpenSSL supports several build systems for engines. Some of them are built-in with the standard OpenSSL library and are available within the engine directory of OpenSSL Source. But the easiest way to add an engine is to use the dynamic loading capability of OpenSSL engine API and to build your engine as a shared library, for example using the following commands:

$ gcc -fPIC -o e_ex.o e_ex.c
$ gcc -shared -o e_ex.so -lcrypto e_ex.o

We can now try to load our engine with openssl to check that everything is OK:

$ openssl engine -t -c `pwd`/e_ex.so

(/<path_to_engine>/e_ex.so) Engine example
Loaded: (engineX) Engine example
     [ available ]

Functional engine

The previous step was sufficient to build a working engine, loadable by OpenSSL but it was not a very useful one. Now let us add a bit of functionality to it.
We will start by adding digest capability, that is we will declare our engine as capable of computing a hash function (namely SHA1).
To realize this we need to implement (at least) the following functions:

/* initialize hash computation */
int sha1_init(EVP_MD_CTX *ctx);
/* consume more data to hash */
int sha1_update(EVP_MD_CTX* ctx, const void* data, size_t count);
/* finish hashing and output digest */
int sha1_final(EVP_MD_CTX* ctx, unsigned char* md);

Those functions shall return 1 on success. In older versions of OpenSSL, those implementations needed to be inserted into a new EVP_MD structure :

static const EVP_MD digest_sha1 = {
  NID_sha1,   /* Name id for SHA1 digest */
  0,          /* pkey_type, NID with private key */
  20,         /* message digest size */
  0,          /* Flags */
  sha1_init,   /* initialization */
  sha1_update, /* update */
  sha1_final,  /* finalize */
  NULL,        /* copy */
  NULL,        /* cleanup */
  EVP_PKEY_NULL_method, /* require pkey type */
  64,                   /* internal block size */
  sizeof(SHA_CTX),      /* size of md_data structure */
  NULL                  /* control function */
};

The next step is to implement a digest selector, a function which will list the digest implementations supported by our engine and allow the program to select one.

static digest_nids[] = {NID_sha1, 0};

int digest_selector(ENGINE* e, const EVP_MD** digest,

                    const int** nids,

                    int nid)
{
  int ok = 1;
  if (!digest) {
    /* expected to return the list of supported NIDs */
    *nids = digest_nids;
    return (sizeof(digest_nids) - 1) / sizeof(digest_nids[0]);
  }

  /** Request for a specific digest */
  switch (nid) {
  case NID_sha1:
    *digest = &digest_sha1;
    break;
  default:
    ok = 0;
    *digest = NULL;
    break;
  }
  return ok;
}

At last we will register this selector for use by the engine.This is done by adding a call to ENGINE_set_digest within our bind function:

static int bind(ENGINE* e, const char* id) 
{
  int ret = 0;
  if (!ENGINE_set_id(e, engine_id)) {
    fprintf(stderr, "ENGINE_set_id failed\n");
    goto end;
  }
  if (!ENGINE_set_name(e, engine_name)) {
    printf("ENGINE_set_name failed\n");
    goto end;
  }
  if (!ENGINE_set_digests(e, digest_selector)) {
    printf("ENGINE_set_digest failed\n");
    goto end;
  }

  ret = 1;
end:
  return ret;
}

After building back our engine shared library we can load it again and see that the SHA1 functionality has been registered:

$ openssl engine -t -c `pwd`/e_ex.so

(/<path_to_engine>/e_ex.so) Engine example
Loaded: (engineX) Engine example

 [SHA1]
     [ available ]

Improvements

The way we declared our SHA1 digest method is deprecated in recent version of OpenSSL. It was not very future proof as we relied on an internal OpenSSL Structure susceptible to change (without notice !).
More recent version of OpenSSL ,offers ad-hoc functions (e.g. EVP_MD_meth_new EVP_CIPHER_meth_new) to build dynamically message digest and cipher implementations.

In the next articles, we will see

How to use your openssl engine using openssl CLI (e.g. openssl dgst)
How to use your openssl engine from a C file (use the configuration interface)
How to extend your openssl engine to support ciphers

Source Code:

The source code for the example engine can be downloaded here (branch part0):
https://github.com/nibrunie/OSSL_EngineX

References

OpenSSL website: https://www.openssl.org/
OpenSSL official blog posts on building your own engine

An other tutorial on a minimal engine (with configuration file description) https://www.sinodun.com/2009/02/developing-an-engine-for-openssl/

Updates:

Updated on August 1st following some remarks from my brother (many typos fixed)