# Compact Hardware Implementations of ChaCha, BLAKE, Threefish, and Skein on FPGA 

Nuray At $^{*}$, Jean-Luc Beuchat ${ }^{\dagger}$, Eiji Okamoto ${ }^{\ddagger}$, İsmail San*, and Teppei Yamazaki ${ }^{\ddagger}$<br>*Department of Electrical and Electronics Engineering, Anadolu University, Eskişehir, Turkey<br>Email: \{nat, isan\}@anadolu.edu.tr<br>${ }^{\dagger}$ ELCA Informatique SA, Av. de la Harpe 22-24, Case postale 519, 1001 Lausanne, Switzerland<br>Email: jeanluc.beuchat@gmail.com<br>${ }^{\ddagger}$ Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573, Japan<br>Email: okamoto@risk.tsukuba.ac.jp, yamazaki@cipher.risk.tsukuba.ac.jp


#### Abstract

The cryptographic hash functions BLAKE and Skein are built from the ChaCha stream cipher and the tweakable Threefish block cipher, respectively. Interestingly enough, they are based on the same arithmetic operations, and the same design philosophy allows one to design lightweight coprocessors for hashing and encryption. The key element of our approach is to take advantage of the parallelism of the algorithms to deeply pipeline our Arithmetic an Logic Units, and to avoid data dependencies by interleaving independent tasks. We show for instance that a fully autonomous implementation of BLAKE and ChaCha on a Xilinx Virtex-6 device occupies 144 slices and three memory blocks, and achieves competitive throughputs. In order to offer the same features, a coprocessor implementing Skein and Threefish requires a substantial higher slice count.


## I. Introduction

The cryptographic hash functions BLAKE [1] and Skein [2] are built from the ChaCha stream cipher [3] and the tweakable Threefish block cipher [2], respectively. It is therefore tempting to design compact unified hardware architectures able to hash and encrypt a message. Such processors are for instance valuable for constrained environments, where some security protocols mainly rely on cryptographic hash functions [4]. Furthermore, as emphasized by Kerckhof et al., "fully unrolled and pipelined architectures may sometimes hide a part of the algorithms' complexity that is better revealed in compact implementations" [5]. In order to have a deeper understanding of the computational efficiency of ChaCha, BLAKE, Threefish, and Skein, we extend here our work presented in [6], [7] and propose novel lightweight coprocessors. The key element of our approach is to take advantage of the parallelism of the algorithms to deeply pipeline our Arithmetic an Logic Units (ALUs), and to avoid data dependencies by interleaving independent tasks.

Throughout this article, all operands are $w$-bit unsigned integers and the following notation is adopted:

- $\boxplus$ : addition modulo $2^{w}$;
- $\boxminus$ : subtraction modulo $2^{w}$;
- $\wedge$ : bitwise AND;
- $V$ : bitwise OR;
- $\oplus$ : bitwise exclusive OR;
- $\gg k$ : rotation by $k$ bits to the right;
- $\ll k k$ : rotation by $k$ bits to the left.

The rest of the article is organized as follows: after a brief overview of Threefish (Section II), Skein (Section III), ChaCha (Section IV), and BLAKE (Section V), we describe our design philosophy and compact hardware implementations (Section VI). We discuss our implementation results on several Xilinx Field-Programmable Gate Arrays (FPGAs) in Section VII and conclude in Section VIII

## II. The Threefish Block Cipher

The design philosophy of Threefish is that "a larger number of simple rounds is more secure than fewer complex rounds" [2]. The key schedule can be computed in a few clock cycles, which is an important consideration in order to build a compression function from a block cipher.
Threefish operates entirely on unsigned 64-bit integers and involves only three operations: rotation of $k$ bits to the left, bitwise exclusive OR, and addition modulo $2^{64}$. Therefore, the plaintext $P$ and the cipher key $K$ are converted to $N_{w} 64$-bit words. Note that the number of words $N_{w}$ and the number of rounds $N_{r}$ depend on the key size (Table $I$ ). The size of a plaintext block is given by $N_{b}=8 \cdot N_{w}$ bytes.

Table I
Number of rounds of Threefish for different key sizes.

| Key size <br> [bits] | \# 64-bit words <br> $\boldsymbol{N}_{\boldsymbol{w}}$ | \# rounds <br> $\boldsymbol{N}_{\boldsymbol{r}}$ | Block size <br> $\boldsymbol{N}_{\boldsymbol{b}}$ [bytes] |
| :---: | :---: | :---: | :---: |
| 256 | 4 | 72 | 32 |
| 512 | 8 | 72 | 64 |
| 1024 | 16 | 80 | 128 |

The key schedule generates the subkeys from a block cipher key $K=\left(k_{0}, k_{1}, \ldots, k_{N_{w}-1}\right)$ and a 128 -bit tweak $T=\left(t_{0}, t_{1}\right) . K$ and $T$ are extended with one parity word (Algorithm 1. lines 1 and 2). Each subkey is a combination of $N_{w}$ words of the extended key, two words of the extended tweak, and a counter $s$ (Algorithm 1 , lines 5 to 9). Note that the extended key and the extended tweak are rotated by one word position between two consecutive subkeys.

Table II
Permutations used by the Skein functions (REPRINTED FRom [2]).

|  | $i$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| :---: | :---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| $\pi(i)$ | $N_{w}=4$ | 0 | 3 | 2 | 1 |  |  |  |  |  |  |  |  |  |  |  |
|  | $N_{w}=8$ | 2 | 1 | 4 | 7 | 6 | 5 | 0 | 3 |  |  |  |  |  |  |  |
|  | $N_{w}=16$ | 0 | 9 | 2 | 13 | 6 | 11 | 4 | 15 | 10 | 7 | 12 | 3 | 14 | 5 | 8 |

```
Algorithm 1 Key schedule of Threefish.
Input: A block cipher key \(K=\left(k_{0}, k_{1}, \ldots, k_{N_{w}-1}\right)\);
    a tweak \(T=\left(t_{0}, t_{1}\right)\); the constant \(C_{240}=\)
    1BD11BDAA9FC1A22.
Output: \(N_{r} / 4+1\) subkeys \(k_{s, 0}, k_{s, 1}, \ldots, k_{s, N_{w}-1}\), where
    \(0 \leq s \leq N_{r} / 4\).
    \(k_{N_{w}} \leftarrow C_{240} \oplus \bigoplus_{i=0}^{N_{w}-1} k_{i} ;\)
    \(t_{2} \leftarrow t \oplus t_{1} ;\)
    for \(s \leftarrow 0\) to \(N_{r} / 4\) do
        for \(i \leftarrow 0\) to \(N_{w}-4\) do
            \(k_{s, i} \leftarrow k_{(s+i) \bmod \left(N_{w}+1\right)} ;\)
        end for
        \(k_{s, N_{w}-3} \leftarrow k_{\left(s+N_{w}-3\right) \bmod \left(N_{w}+1\right)} \boxplus t_{s \bmod 3} ;\)
        \(k_{s, N_{w}-2} \leftarrow k_{\left(s+N_{w}-2\right) \bmod \left(N_{w}+1\right)} \boxplus t_{(s+1) \bmod 3} ;\)
        \(k_{s, N_{w}-1} \leftarrow k_{\left(s+N_{w}-1\right) \bmod \left(N_{w}+1\right)} \boxplus s ;\)
    end for
    return \(k_{s, 0}, k_{s, 1}, \ldots, k_{s, N_{w}-1}\), where \(0 \leq s \leq N_{r} / 4\);
```

A series of $N_{r}$ rounds (Figure 1 and Algorithm 2, lines 4 to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core of a round is the simple non-linear mixing function $\mathrm{Mix}_{d, j}$ (Algorithm 2, lines 13 and 14. It consists of an addition, a rotation by a constant $R_{d \bmod 8, j}$ (repeated every eight rounds and defined in [2, Table 4]), and a bitwise exclusive OR. A word permutation $\pi(i)$ (defined in Table [I]) is then applied to obtain the output of the round (Algorithm 22 line 17). Furthermore, a subkey is injected every four rounds (Algorithm 2, line 7).

Figure 2 describes a decryption round of Threefish- 256. It consists of the inverse word permutation followed by the inverse MIX functions. Note that subkeys are injected in reverse order.

## III. The Skein Family of Hash Functions

The Unique Block Iteration (UBI) chaining mode allows one to build a compression function out of a tweakable encryption function. Let $M$ be a message of arbitrary length up to $2^{99}-8$ bits. If the number of bits in $M$ is not a multiple of 8 , we append a bit 1 followed by a (possibly empty) string of 0's. This step guarantees that $M$ contains $N_{M}$ bytes. Then, we pad $M$ with $p$ zero bytes so that $N_{M}+p$ is a multiple of the block size $N_{b}$. We can now split $M$ into $N_{b}$-byte blocks $M_{0}, \ldots$, $M_{k-1}$, where $k=\left(N_{M}+p\right) / N_{b}$. Each block $M_{i}$ is processed with a unique tweak value $T_{i}$ encoding how many bytes have been processed so far, a type field (see [2] for details), and


Figure 1. One of the 72 encryption rounds of Threefish- 256.


Figure 2. One of the 72 decryption rounds of Threefish- 256 .
two bits specifying whether it is the first and/or last block. The UBI chaining mode is computed as:

$$
\begin{aligned}
H_{0} & \leftarrow G, \\
H_{i+1} & \leftarrow M_{i} \oplus E\left(H_{i}, T_{i}, M_{i}\right),
\end{aligned}
$$

where $G$ is a starting value of $N_{b}$ bytes.
In this work, we consider the normal hashing mode and refer the reader to [2] for a description of Skein-MAC and tree hashing with Skein. Skein is built on three invocations of UBI (Figure 3):

- Define a 32-byte configuration string $C$ that contains the length of the digest size (in bits), a schema identifier,

```
Algorithm 2 Encryption with the Threefish block cipher.
Input: A plaintext block \(P=\left(p_{0}, p_{1}, \ldots, p_{N_{w}-1}\right) ; N_{r} / 4+1\)
    subkeys \(k_{s, 0}, k_{s, 1}, \ldots, k_{s, N_{w}-1}\), where \(0 \leq s \leq N_{r} / 4\);
    \(4 N_{w}\) rotation constants \(R_{i, j}\), where \(0 \leq i \leq 7\) and \(0 \leq\)
    \(j \leq N_{w} / 2\).
Output: A ciphertext block \(C=\left(c_{0}, c_{1}, \ldots, c_{N_{w}-1}\right)\).
    for \(i \leftarrow 0\) to \(N_{w}-1\) do
        \(v_{0, i} \leftarrow p_{i} ;\)
    end for
    for \(d \leftarrow 0\) to \(N_{r}-1\) do
        for \(i \leftarrow 0\) to \(N_{w}-1\) do
            if \(d \bmod 4=0\) then
                \(e_{d, i} \leftarrow v_{d, i} \boxplus k_{d / 4, i} ; \quad\) (Key injection)
            else
                    \(e_{d, i} \leftarrow v_{d, i} ;\)
            end if
        end for
        for \(j \leftarrow 0\) to \(N_{w} / 2-1\) do
            \(f_{d, 2 j} \leftarrow e_{d, 2 j} \boxplus e_{d, 2 j+1} ; \quad\left(\operatorname{Mix}_{d, j}\right)\)
            \(f_{d, 2 j+1} \leftarrow f_{d, 2 j} \oplus\left(e_{d, 2 j+1} \lll R_{d \bmod 8, j}\right) ;\)
        end for
        for \(i \leftarrow 0\) to \(N_{w}-1\) do
            \(v_{d+1, i} \leftarrow f_{d, \pi(i)} ;\)
        end for
    end for
    for \(i \leftarrow 0\) to \(N_{w}-1\) do
        \(c_{i} \leftarrow v_{N_{r}, i} \boxplus k_{N_{r} / 4, i} ;\)
        (Key injection)
    end for
    return \(C=\left(c_{0}, c_{1}, \ldots, c_{N_{w}-1}\right)\);
```

and a version number [2, Table 7]. Compute the $N_{b}$-byte block $G_{0}$ :

$$
G_{0} \leftarrow \mathrm{UBI}\left(0, C, T_{\mathrm{cfg}} 2^{120}\right)
$$

Note that $G_{0}$ only depends on the digest size and can easily be precomputed.

- The message is then processed as follows:

$$
G_{1} \leftarrow \mathrm{UBI}\left(G_{0}, M, T_{\mathrm{msg}} 2^{120}\right) .
$$

- A third call to UBI is required to achieve hashingappropriate randomness:

$$
H \leftarrow \operatorname{UBI}\left(G_{1}, 0, T_{\text {out }} 2^{120}\right)
$$

This transform allows one to produce arbitrary digest sizes (up to $2^{64}$ bits). If a single output block $H$ is not enough, one can use Threefish in counter mode to produce the digest.

## IV. The ChaCha Stream Cipher

The ChaCha family of stream ciphers was designed by Bernstein [3] to improve the diffusion per round of Salsa20 [8], while preserving the encryption rate. ChaCha operates on 32 bit words, and expands a 256 -bit key $\left(k_{0}, \ldots, k_{7}\right)$ and a 64 -bit nonce $\left(\mathrm{IV}_{0}, \mathrm{IV}_{1}\right)$ into a $2^{70}$-byte stream. A $b$-byte message is


Figure 3. Skein in normal hashing mode.
then encrypted (or decrypted) by XORing it with the first $b$ bytes of the stream.

ChaCha generates the stream by blocks of 64 bytes. In order to process the $i$ th block, ChaCha acts on a $4 \times 4$ matrix $M$ of 32-bit integers defined as follows:

$$
\left(\begin{array}{cccc}
m_{0} & m_{1} & m_{2} & m_{3} \\
m_{4} & m_{5} & m_{6} & m_{7} \\
m_{8} & m_{9} & m_{10} & m_{11} \\
m_{12} & m_{13} & m_{14} & m_{15}
\end{array}\right)=\left(\begin{array}{cccc}
c_{0} & c_{1} & c_{2} & c_{3} \\
k_{0} & k_{1} & k_{2} & k_{3} \\
k_{4} & k_{5} & k_{6} & k_{7} \\
t_{0} & t_{1} & \mathrm{IV}_{0} & \mathrm{IV}_{1}
\end{array}\right)
$$

where

- $c_{0}=61707865, c_{1}=3320646 \mathrm{E}, c_{2}=79622 \mathrm{D} 32$, and $c_{3}=6 \mathrm{~B} 206574$ are predefined constants;
- $t=\left(t_{0}, t_{1}\right)$ is a 64 -bit counter encoding the index $i$ (i.e. $\left.i=2^{32} t_{1}+t_{0}\right)$.
ChaCha transforms the matrix $M$ through a series of $N_{r}$ rounds (Algorithm 3). The algorithm is based on a nonlinear operation called quarter-round function and described by Algorithm 4. Matrix $M$ is copied into matrix $V$. Then, the evenand odd-numbered rounds of ChaCha apply the quarter-round function to each row and northwest-to-southeast diagonal of $V$, respectively. Eventually, a new block of the stream is generated by adding $V$ to the original matrix $M$ (Algorithm 3 line 15), and the block counter is incremented (Algorithm 33, lines 17 to 20.

Bernstein proposed 8-, 12 -, and 20 -round variants of ChaCha. Aumasson et al. introduced a novel method for differential cryptanalysis of ChaCha and broke the 7 -round variant [9]. Ishiguro et al. [10] improved the attack and concluded that "an adversary can distinguish keystream bits from random bits using a few input and output pairs of an initial key and initial vectors" of the 8 -round variant of ChaCha [10]. However, their attack cannot be directly applied to the 12 - and 20 -round variants.

## V. The BLAKE Family of Hash Functions

The BLAKE family combines three previously studied components, chosen by Aumasson et al. for their complementarity [1]: the iteration mode HAIFA, the internal structure of the hash function LAKE, and a modified version of Bernstein's stream cipher ChaCha as compression function. BLAKE is a family of four hash functions, namely BLAKE-224, BLAKE256, BLAKE-384, and BLAKE-512 (Table III). The main

```
Algorithm 3 Computation of a 64-byte block of the stream
of ChaCha.
Input: A key, a nonce, and a block counter stored in a matrix
    \(M\).
Output: A 64-byte block of the stream.
    for \(i \leftarrow 0\) to 15 do
        \(v[i] \leftarrow m[i] ;\)
    end for
    for \(i \leftarrow 0\) to \(N_{r} / 2-1\) do
        \(\operatorname{QUARTERROUND}\left(v_{0}, v_{4}, v_{8}, v_{12}\right)\);
        \(\operatorname{QUARTERROUND}\left(v_{1}, v_{5}, v_{9}, v_{13}\right)\);
        QUARTERROUND \(\left(v_{2}, v_{6}, v_{10}, v_{14}\right)\);
        QUARTERROUND \(\left(v_{3}, v_{7}, v_{11}, v_{15}\right)\);
        QUARTERROUND \(\left(v_{0}, v_{5}, v_{10}, v_{15}\right)\);
        QUARTERROUND \(\left(v_{1}, v_{6}, v_{11}, v_{12}\right)\);
        \(\operatorname{QUARTERROUND}\left(v_{2}, v_{7}, v_{8}, v_{13}\right)\);
        \(\operatorname{QUARTERROUND}\left(v_{3}, v_{4}, v_{9}, v_{14}\right)\);
    end for
    for \(i \leftarrow 0\) to 15 do
        \(v[i] \leftarrow v[i] \boxplus m[i] ;\)
    end for
    \(m_{12} \leftarrow m_{12} \boxplus 1 ;\)
    if \(m_{12}=0\) then
        \(m_{13} \leftarrow m_{13} \boxplus 1 ;\)
    end if
    Return \(M\) and \(V\);
```

```
Algorithm 4 The ChaCha quarter-round function.
Input: Four 32-bit integers \(a, b, c\), and \(d\).
Output: QUARTERROUND \((a, b, c, d)\).
    \(a \leftarrow a \boxplus b ;\)
    \(d \leftarrow(d \oplus a) \lll 16 ;\)
    \(c \leftarrow c \boxplus d ;\)
    \(b \leftarrow(b \oplus c) \lll 12 ;\)
    \(a \leftarrow a \boxplus b ;\)
    \(d \leftarrow(d \oplus a) \lll 8 ;\)
    \(c \leftarrow c \boxplus d ;\)
    \(b \leftarrow(b \oplus c) \lll 7 ;\)
    Return \(a, b, c\), and \(d\);
```

differences lie in the length of words $w$, the number of rounds $N_{r}$, and in some constants involved in the algorithm. In the following, we denote by BLAKE- $n$ the algorithm with an $n$-bit digest.

BLAKE- $n$ involves only two arithmetic operations: the addition modulo $2^{w}$ of two $w$-bit unsigned integers and the bitwise exclusive OR of two $w$-bit words. The latter is sometimes followed by a rotation of $\delta_{j}$ bits to the right. The four possible rotation distances depend on the digest size and are defined in Table III The compression function of BLAKE$n$ produces a new chain value $h^{\prime}=h_{0}^{\prime}, \ldots, h_{7}^{\prime}$ from a message block $m=m_{0}, \ldots, m_{15}$, a chain value $h=h_{0}, \ldots, h_{7}$, a salt $s=s_{0}, \ldots, s_{3}$, a counter $t=t_{0}, t_{1}$, and 16 constants $c_{i}$ defined in [1] p. 8]. This process consists of three steps. First, a

16 -word internal state $v=v_{0}, \ldots, v_{15}$ is initialized as follows:

$$
\begin{aligned}
&\left(\begin{array}{cccc}
v_{0} & v_{1} & v_{2} & v_{3} \\
v_{4} & v_{5} & v_{6} & v_{7} \\
v_{8} & v_{9} & v_{10} & v_{11} \\
v_{12} & v_{13} & v_{14} & v_{15}
\end{array}\right) \\
& \leftarrow\left(\begin{array}{cccc}
h_{0} & h_{1} & h_{2} & h_{3} \\
h_{4} & h_{5} & h_{6} & h_{7} \\
s_{0} \oplus c_{0} & s_{1} \oplus c_{1} & s_{2} \oplus c_{2} & s_{3} \oplus c_{3} \\
t_{0} \oplus c_{4} & t_{0} \oplus c_{5} & t_{1} \oplus c_{6} & t_{1} \oplus c_{7}
\end{array}\right)
\end{aligned}
$$

Then, a series of $N_{r}$ rounds is performed. Each of them consists of a transformation of the internal state $v$ based on the $G_{i}$ function described by Algorithm5 , where $\sigma_{r}$ denotes a permutation of $\{0, \ldots, 15\}$ parametrized by the round index $r$ (see Table IV. A column step updates the four columns of matrix $v$ as follows: $G_{0}\left(v_{0}, v_{4}, v_{8}, v_{12}\right), G_{1}\left(v_{1}, v_{5}, v_{9}, v_{13}\right)$, $G_{2}\left(v_{2}, v_{6}, v_{10}, v_{14}\right)$, and $G_{3}\left(v_{3}, v_{7}, v_{11}, v_{15}\right)$. Note that each call to $G_{i}$ updates a distinct column of matrix $v$. Since we focus on compact implementations of BLAKE in this work, we interleave the computation of $G_{0}, G_{1}, G_{2}$, and $G_{3}$. This approach allows us to design an ALU with four pipeline stages and to achieve high clock frequencies. Then, a diagonal step updates the four diagonals of $v$ : $G_{4}\left(v_{0}, v_{5}, v_{10}, v_{15}\right), G_{5}\left(v_{1}, v_{6}, v_{11}, v_{12}\right), G_{6}\left(v_{2}, v_{7}, v_{8}, v_{13}\right)$, and $G_{7}\left(v_{3}, v_{4}, v_{9}, v_{14}\right)$. Here again, each call to $G_{i}$ modifies a distinct diagonal of the matrix, allowing us to interleave the computation of $G_{4}, G_{5}, G_{6}$, and $G_{7}$.

```
Algorithm 5 The \(G_{i}\) function.
Input: A function index \(i\) and four \(w\)-bit integers \(a, b, c\), and
    \(d\).
Output: \(G_{i}(a, b, c, d)\).
    \(a \leftarrow a \boxplus b ;\)
    \(a \leftarrow a \boxplus\left(m_{\sigma_{r}(2 i)} \oplus c_{\sigma_{r}(2 i+1)}\right) ;\)
    \(d \leftarrow(d \oplus a) \ggg \delta_{0} ;\)
    \(c \leftarrow c \boxplus d ;\)
    \(b \leftarrow(b \oplus c) \ggg \delta_{1} ;\)
    \(a \leftarrow a \boxplus b ;\)
    \(a \leftarrow a \boxplus\left(m_{\sigma_{r}(2 i+1)} \oplus c_{\sigma_{r}(2 i)}\right) ;\)
    \(d \leftarrow(d \oplus a) \ggg \delta_{2} ;\)
    \(c \leftarrow c \boxplus d ;\)
    \(b \leftarrow(b \oplus c) \ggg \delta_{3} ;\)
```

At the end of the last round, a new chain value $h^{\prime}=$ $h_{0}^{\prime}, \ldots, h_{7}^{\prime}$ is computed from the internal state $v$ and the previous chain value $h$ (finalization step):

$$
\begin{array}{ll}
h_{0}^{\prime} \leftarrow h_{0} \oplus s_{0} \oplus v_{0} \oplus v_{8}, & h_{4}^{\prime} \leftarrow h_{4} \oplus s_{0} \oplus v_{4} \oplus v_{12}, \\
h_{1}^{\prime} \leftarrow h_{1} \oplus s_{1} \oplus v_{1} \oplus v_{9}, & h_{5}^{\prime} \leftarrow h_{5} \oplus s_{1} \oplus v_{5} \oplus v_{13}, \\
h_{2}^{\prime} \leftarrow h_{2} \oplus s_{2} \oplus v_{2} \oplus v_{10}, & h_{6}^{\prime} \leftarrow h_{6} \oplus s_{2} \oplus v_{6} \oplus v_{14}, \\
h_{3}^{\prime} \leftarrow h_{3} \oplus s_{3} \oplus v_{3} \oplus v_{11}, & h_{7}^{\prime} \leftarrow h_{7} \oplus s_{3} \oplus v_{7} \oplus v_{15} .
\end{array}
$$

In order to guarantee that the length $\ell$ of a message is a multiple of the block size $b$, Aumasson et al. define the following padding scheme [1]:

Table III
Properties of the BLAKE hash functions.

| Algorithm | Word size $\boldsymbol{w}$ <br> [bits] | Message <br> [bits] | Block <br> [bits] | Digest | Sbits] | Sbits] | \# rounds |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | ---: | ---: |
| $\boldsymbol{N}_{\boldsymbol{r}}$ | Rotation distances |  |  |  |  |  |  |  |  |  |
| $\boldsymbol{\delta}_{\mathbf{0}}$ | $\boldsymbol{\delta}_{\mathbf{1}}$ | $\boldsymbol{\delta}_{\mathbf{2}}$ | $\boldsymbol{\delta}_{\mathbf{3}}$ |  |  |  |  |  |  |  |
| BLAKE-224 | 32 | $<2^{64}$ | 512 | 224 | 128 | 14 | 16 | 12 | 8 | 7 |
| BLAKE-256 | 32 | $<2^{64}$ | 512 | 256 | 128 | 14 | 16 | 12 | 8 | 7 |
| BLAKE-384 | 64 | $<2^{128}$ | 1024 | 384 | 256 | 16 | 32 | 25 | 16 | 11 |
| BLAKE-512 | 64 | $<2^{128}$ | 1024 | 512 | 256 | 16 | 32 | 25 | 16 | 11 |

Table IV
Permutations of $\{0, \ldots, 15\}$ USED by the BLAKE Functions (REPRINTED FROM |1]).

| $i$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| :---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| $\sigma_{0}(i)$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| $\sigma_{1}(i)$ | 14 | 10 | 4 | 8 | 9 | 15 | 13 | 6 | 1 | 12 | 0 | 2 | 11 | 7 | 5 | 3 |
| $\sigma_{2}(i)$ | 11 | 8 | 12 | 0 | 5 | 2 | 15 | 13 | 10 | 14 | 3 | 6 | 7 | 1 | 9 | 4 |
| $\sigma_{3}(i)$ | 7 | 9 | 3 | 1 | 13 | 12 | 11 | 14 | 2 | 6 | 5 | 10 | 4 | 0 | 15 | 8 |
| $\sigma_{4}(i)$ | 9 | 0 | 5 | 7 | 2 | 4 | 10 | 15 | 14 | 1 | 11 | 12 | 6 | 8 | 3 | 13 |
| $\sigma_{5}(i)$ | 2 | 12 | 6 | 10 | 0 | 11 | 8 | 3 | 4 | 13 | 7 | 5 | 15 | 14 | 1 | 9 |
| $\sigma_{6}(i)$ | 12 | 5 | 1 | 15 | 14 | 13 | 4 | 10 | 0 | 7 | 6 | 3 | 9 | 2 | 8 | 11 |
| $\sigma_{7}(i)$ | 13 | 11 | 7 | 14 | 12 | 1 | 3 | 9 | 5 | 0 | 15 | 4 | 8 | 6 | 2 | 10 |
| $\sigma_{8}(i)$ | 6 | 15 | 14 | 9 | 11 | 3 | 0 | 8 | 12 | 2 | 13 | 7 | 1 | 4 | 10 | 5 |
| $\sigma_{9}(i)$ | 10 | 2 | 8 | 4 | 7 | 6 | 1 | 5 | 15 | 11 | 9 | 14 | 3 | 12 | 13 | 0 |

- append a bit 1 followed by a sufficient number of 0 bits such that the length is congruent to $b-2 w-1$ modulo $b$;
- a padding bit followed by the $2 w$-bit unsigned big-endian representation of $\ell$ is then added; in the case of BLAKE256 and BLAKE-512, the padding bit is equal to 1 ; otherwise, it is set to 0 .
The hash can now be computed iteratively (Algorithm 6): the padded message is divided into $N 16$-word blocks $m^{(0)}, \ldots, m^{(N-1)}$ and the chain value $h^{(0)}$ is set to the same initial value as SHA- $n$. The counter $t^{(i)}$ denotes the number of message bits in $m^{(0)}, \ldots, m^{(i)}$ (i.e. excluding padding bits). Note that, if the last block contains only padding bits, then $t^{(N-1)}$ is set to zero. The message digest consists of the $n$ least significant bits of the output $h^{(N)}$.

```
Algorithm 6 Iterated hash.
Input: A padded message split into \(N 16\)-word blocks and a
    salt \(s\).
Output: A \(n\)-bit digest.
```

```
    ( (ho (0)},\ldots,\mp@subsup{h}{7}{(0)})\leftarrow(\mp@subsup{\textrm{IV}}{0}{},\ldots,\mp@subsup{\textrm{IV}}{7}{})
```

    ( (ho (0)},\ldots,\mp@subsup{h}{7}{(0)})\leftarrow(\mp@subsup{\textrm{IV}}{0}{},\ldots,\mp@subsup{\textrm{IV}}{7}{})
    for }i\leftarrow0\mathrm{ to N-1 do
    for }i\leftarrow0\mathrm{ to N-1 do
        h(i+1)}\leftarrow\mathrm{ compress ( }\mp@subsup{h}{}{(i)},\mp@subsup{m}{}{(i)},s,\mp@subsup{t}{}{(i)})
        h(i+1)}\leftarrow\mathrm{ compress ( }\mp@subsup{h}{}{(i)},\mp@subsup{m}{}{(i)},s,\mp@subsup{t}{}{(i)})
    end for
    end for
    return }\mp@subsup{h}{}{(N)}\mathrm{ ;
    ```
    return }\mp@subsup{h}{}{(N)}\mathrm{ ;
```


## VI. Hardware Implementation

All of our architectures consist of a register file organized into $w$-bit words and implemented by means of dual-ported memory, an ALU, and a control unit (Figure 4). The user loads messages, plaintext blocks or ciphertext blocks into
port A. A few control bits allows her to select the algorithm and the desired level of security. When the coprocessors are hashing or encrypting a message, the intermediates results are always written to port $B$. In the following, we assume that our coprocessor is provided with padded messages. A hardware wrapper interface for BLAKE, Skein, and several other hash functions comprising communication and padding is described in [14].


Figure 4. General architecture of our coprocessors.
We follow here the design strategy outlined in [6], [7], [1]][13]. The first step consists in defining the minimal instruction set to implement a block cipher and a hash function. Then, an in-depth study of the scheduling allows us to build the ALU and organize the data in the register file. During this step, we

- try to minimize the number of control bits to keep the instruction memory as compact as possible;
- take advantage of FPGA specifics to optimize the slice count;
- identify the available parallelism and pipeline the ALU accordingly.
Eventually, we design the control unit. The instruction memory is automatically generated by a C program. In order to keep the instruction ROM as compact as possible, our C program is able to compress the code, and to generate the VHDL description of the decompression unit.


## A. Arithmetic and Logic Units for Threefish and Skein

Our first ALU implements Threefish encryption and Skein. In the following, $\mathrm{R} i$ denotes a 64 -bit register. Figure 5 illustrates our scheduling of the two mixing functions Mix ${ }_{4,0}$ and Mix $_{4,1}$ of the fifth round of Threefish-256:

- The operand $e_{4,1}$ is loaded in register R1; at the same time, we start the computation of $e_{4,1} \lll R_{4,0}$; this operation requires three clock cycles and intermediate results are stored in R4, R5, and R6.
- Then, $e_{4,0}$ is loaded in register R2; the content of R1 is not modified (i.e. R1 must be controlled by an enable signal).
- We execute the instruction $\mathrm{R} 3 \leftarrow \mathrm{R} 1 \boxplus \mathrm{R} 2$ and obtain $f_{4,0}$.
- R3 and R6 contain $f_{4,0}$ and $e_{4,1} \lll R_{4,0}$, respectively. The instruction $\mathrm{R} 3 \leftarrow \mathrm{R} 3 \oplus \mathrm{R} 6$ allows us to compute $f_{4,1}$.


Figure 5. Computation of $\operatorname{Mix}_{4,0}$ and $\operatorname{Mix}_{4,1}$ (Threefish-256).

We schedule $\mathrm{Mix}_{4,1}$ as soon as $e_{4,0}$ has been read, and manage to keep the pipeline continuously busy. In summary, our ALU must be able to carry out any rotation of a 64 -bit word and to perform the following operation (Figure 6):

$$
\mathrm{R} 3 \leftarrow \begin{cases}\mathrm{R} 1 \boxplus \mathrm{R} 2 \quad & \text { when } \mathrm{ctrl}_{10}=0  \tag{1}\\ \mathrm{R} 3 \oplus \mathrm{R} 6 & \text { otherwise }\end{cases}
$$

where $\operatorname{ctrl}_{10}$ denotes a control bit. Let us define two 64 -bit operands $a$ and $b$ such that:

$$
(a, b)= \begin{cases}(\mathrm{R} 1, \mathrm{R} 2) & \text { when } \mathrm{ctrl}_{10}=0 \\ (\mathrm{R} 3, \mathrm{R} 6) & \text { otherwise }\end{cases}
$$

It is well-known that $a \boxplus b=(a \vee b) \boxplus(a \wedge b)$ and $a \oplus b=$ $(a \vee b) \boxminus(a \wedge b)$ [15]. Thus, Equation (1) can be rewritten as follows:

$$
\begin{equation*}
\mathrm{R} 3 \leftarrow(a \vee b) \boxplus\left((a \wedge b) \oplus \operatorname{ctrl}_{10}\right) \boxplus \operatorname{ctrl}_{10} . \tag{2}
\end{equation*}
$$

Figure 7 describes the implementation of Equation (2) on a Virtex-6 device. Since there is a single control signal to choose the arithmetic operation and to select $a$ and $b$, Equation (2) involves only five variables, and is advantageously implemented by 64 LUT6_2 primitives and dedicated carry logic.


Figure 6. Arithmetic and logic unit for Threefish encryption.


Figure 7. Computation of $\mathrm{R} 3 \leftarrow \mathrm{R} 1 \boxplus \mathrm{R} 2$ or $\mathrm{R} 3 \leftarrow \mathrm{R} 3 \oplus \mathrm{R} 6$ on a Virtex- 6 device.

In order to reduce the number of operands stored in the register file, we interleave the key schedule (Algorithm 1) and the encryption process (Algorithm 22). This approach allows us to generate the subkeys on-the-fly. It is however necessary to compute $t_{2}$ and $k_{N_{w}}$ before the first key injection. The easiest way to compute $t_{2}$ would be to load $t_{0}$ and $t_{1}$ in registers R1 and R2, respectively, and to execute the instruction R3 $\leftarrow$ $\mathrm{R} 1 \oplus \mathrm{R} 2$. Unfortunately, this solution requires one more control bit to select the inputs of the arithmetic operator, and it is not possible to implement the multiplexers and the adder on the same LUT6_2 primitive anymore. Since the critical path of our coprocessor is located in the 64 -bit adder, an extra level of LUTs would decrease the clock frequency. However, we are able to compute $t_{2}$ using only the functionalities defined by Equation (1). Since $t_{2}=\left(t_{0} \boxplus 0\right) \oplus\left(t_{1} \lll 0\right)$, it suffices to execute the following instructions:

$$
\begin{array}{lll}
\mathrm{R} 4 \leftarrow t_{1} \lll 0, & \mathrm{R} 2 \leftarrow 0, & \mathrm{R} 5 \leftarrow \mathrm{R} 4 \lll 0, \\
\mathrm{R} 1 \leftarrow t_{0}, & \mathrm{R} 6 \leftarrow \mathrm{R} 5 \lll 0, & \\
\mathrm{R} 3 \leftarrow \mathrm{R} 1 \oplus \mathrm{R} 2, & \\
\mathrm{R} 3 \leftarrow \mathrm{R} 3 \oplus \mathrm{R} 6 . & &
\end{array}
$$

This approach assumes that we can read simultaneously two values from the register file. Thanks to the multiplexer controlled by $\operatorname{ctrl}_{7}$, we can load data from port A or port B into register $\mathrm{R}_{2}$ (Figure 6). A similar strategy allows us to compute $k_{N_{w}}$.

The implementation of the key injection is more straightforward. Note that the multiplexers controlled by $\operatorname{ctrl}_{6}$ and $\operatorname{ctrl}_{8}$ allow us to bypass the register file and to use the content of R3 as an input to the ALU. Let us consider for instance the first key injection of Threefish-256: $e_{0,2}$ is defined as $p_{2} \boxplus k_{0,2}=p_{2} \boxplus k_{2} \boxplus t_{1}$ and is computed as follows:

$$
\begin{array}{ll}
\mathrm{R} 1 \leftarrow k_{2}, & \mathrm{R} 2 \leftarrow t_{1}, \\
\mathrm{R} 3 \leftarrow \mathrm{R} 1 \oplus \mathrm{R} 2 & \\
\mathrm{R} 1 \leftarrow \mathrm{R} 3, & \mathrm{R} 2 \leftarrow p_{2}, \\
\mathrm{R} 3 \leftarrow \mathrm{R} 1 \oplus \mathrm{R} 2 . &
\end{array}
$$

Figures 8 and 9 describe how we schedule the instructions of Threefish-256.

The UBI chaining mode can be combined with the final key injection of Threefish encryption. It suffices to modify line 21 of Algorithm 2 as follows:

$$
\begin{aligned}
e_{N_{r}, i} & \leftarrow v_{N_{r}, i} \boxplus k_{N_{r} / 4, i} \\
c_{i} & \leftarrow e_{N_{r}, i} \oplus p_{i} .
\end{aligned}
$$

The only difference between this operation and the mixing function MIX ${ }_{d, j}$ is that no permutation is applied to the second operand of the bitwise exclusive OR.

The inverse of the MIX function being purely sequential, Threefish decryption has less parallelism than encryption. We suggest to modify our ALU as follows to fully support both encryption and decryption (Figure 10):

- The inverse of the Mix function and the inverse of the key injection require a subtraction modulo $2^{64}$. Our modified

ALU is therefore able to perform a new operation: R3 $\leftarrow$ R1日R2. Because of the additional control bit required to select the operation, it is not possible to implement our arithmetic operator by means of 64 LUT6_2 anymore. Thus, the slice count and the critical path are expected to increase.

- The output of the inverse Mix function is provided either by the arithmetic operator (e.g. $e_{4,0}$ on Figure 2) or the rotation unit (e.g. $e_{4,1}$ on Figure 2). The multiplexer controlled by $\operatorname{ctrl}_{17}$ allows us to select the word we store in the register file.
- Since the inverse of the Mix function is sequential, we have to perform the rotation in a single clock cycle. We suggest to take advantage of the SRL16E primitive available on Xilinx devices to implement a FIFO whose depth is dynamically adjusted according to the algorithm selected by the user: one and three stages for decryption and encryption, respectively.


Figure 10. Arithmetic and logic unit for Threefish encryption and decryption.

## B. Arithmetic and Logic Units for BLAKE and ChaCha

Let us consider the $G_{i}$ function of BLAKE- $n$ to define the instruction set of our coprocessors. Since we focus on compact coprocessors for the BLAKE family in this article, we perform a single step of Algorithm 55 at each clock cycle. We will show later that the input operand $b$ is already stored in an internal register of our ALU when we start the computation of $G_{i}(a, b, c, d)$. Therefore, each operation involves the result of the previous one, and our ALU will include a feedback mechanism to bypass the register file of the coprocessor.

Assume that the $w$-bit word computed by the ALU is stored in register R 5 , and denote by $\mathrm{RF}_{A}$ and $\mathrm{RF}_{B}$ the operands provided by the register file. From the data flow diagram of Algorithm 55 we easily identify three operations (Figure 11):

1) Save the content of R5 in the register file and compute $\mathrm{R} 5 \leftarrow \mathrm{R} 5 \boxplus \mathrm{RF}_{A}$.
2) Compute $\mathrm{R} 5 \leftarrow \mathrm{R} 5 \boxplus\left(\mathrm{RF}_{A} \oplus \mathrm{RF}_{B}\right)$.
3) Save the content of R 5 in the register file and compute $\mathrm{R} 5 \leftarrow\left(\mathrm{R} 5 \oplus \mathrm{RF}_{A}\right) \ggg \delta_{j}$.


Figure 8. Scheduling of Threefish-256 encryption. @ d denotes the address of the 64 -bit word $d$ in the register file. "Rename" and "Permute" refer to lines 9 and 17 of Algorithm 2 respectively.

Recall now that the four calls to $G_{i}$ in a column step or a diagonal step can be computed in parallel. In order to keep the critical path as short as possible, we suggest to design an ALU with four pipeline stages and to interleave the computation of four $G_{i}$ functions (Figure 12). The heart of the ALU is the arithmetic operator performing the addition or the bitwise XOR of two $w$-bit words described in Section VI-A. Our operator computes:

$$
\begin{aligned}
\mathrm{R} 3 & \leftarrow \begin{cases}\mathrm{R} 1 \boxplus \mathrm{R} 2 \quad \text { when } \mathrm{ctrl}_{2}=0 \\
\mathrm{R} 1 \oplus \mathrm{R} 2 & \text { otherwise }\end{cases} \\
& =(\mathrm{R} 1 \vee \mathrm{R} 2) \boxplus\left((\mathrm{R} 1 \wedge \mathrm{R} 2) \oplus \mathrm{ctr}_{2}\right) \boxplus \mathrm{ctrl}_{2},
\end{aligned}
$$

where

- R1 stores the data provided by the register file. Since a flip-flop is always associated with a LUT, we can perform some simple pre-processing without increasing the number of slices of the ALU: a control bit $\operatorname{ctrl}_{0}$ selects either $\mathrm{RF}_{A}$ or $\mathrm{RF}_{A} \oplus \mathrm{RF}_{B}$. This allows us to compute $m_{\sigma_{r}(2 i)} \oplus c_{\sigma_{r}(2 i+1)}$ and $m_{\sigma_{r}(2 i+1)} \oplus c_{\sigma_{r}(2 i)}$ for free (Algorithm 5 lines 2 and 7 ).
- R2 almost always stores the result of a previous operation. However, we have to disable the feedback mechanism during the initialization step: the computation of $v_{8} \leftarrow$ $s_{0} \oplus c_{0}$ involves for instance only two words stored in the register file. An array of AND gates controlled by $\operatorname{ctrl}_{1}$ allows us to force the second operand to zero in such cases.
If needed, the content of register R3 is then rotated to the left in two steps. Our implementation is based on the following observation:

$$
\mathrm{R} 3 \ggg \delta_{i}=\left(\mathrm{R} 3 \ggg\left(\delta_{i}-\delta_{3}\right)\right) \ggg \delta_{3},
$$

where $0 \leq i \leq 3$. At first glance, this design choice may look awkward. However, it will allow us to easily build a unified processor for the BLAKE family. The key point is that the content of R3 is copied into R5 when the three control bits $\mathrm{ctrl}_{5: 3}$ are equal to 0 (Figure (12).
Note that the pipeline has three possible configurations, denoted by (1), (2), and (3) in Figure 12.
(1) In order to minimize the area of our ALU, we can insert


Figure 9. Scheduling of Threefish-256 decryption.
a $w$-bit register after the first stage of the rotation. Since the latter involves $w$ LUTs, there is no hardware overhead on a Virtex-6 device.
(2) The addition modulo $2^{w}$ can be computed in two clock cycles. Let $a=a_{\text {low }}+2^{\frac{w}{2}} a_{\text {high }}$ and $b=b_{\text {low }}+2^{\frac{w}{2}} b_{\text {high }}$. We store $a_{\text {high }}$ and $b_{\text {high }}$ in two $\frac{w}{2}$-bit registers, and compute a sum word $s_{\text {low }}$ and a carry bit such that $2^{\frac{w}{2}} c+s_{\text {low }}=$ $a_{\text {low }}+b_{\text {low }}$. A flip-flop and a $\frac{w}{2}$-bit register store $c$ and $s_{\text {low }}$, respectively. The most significant bits of the sum are then given by $s_{\text {high }}=a_{\text {high }}+b_{\text {high }}+c$. This approach allows us to reduce the worst-case carry path at the price of three $\frac{w}{2}$-bit registers and a flip-flop.
(3) Routing a signal from a memory block to a slice is sometimes expensive in terms of wire delay. If the critical path is located between the register file and register R1, this pipeline configuration will help boosting the clock frequency. The output data path of a Virtex-6 memory block has an optional internal pipeline register. Therefore, the only hardware overhead is the $w$-bit register between R5 and the array of AND gates controlled by $\mathrm{ctrl}_{1}$.

In order to avoid pipeline bubbles between column and
diagonal steps, it suffices to process the four calls to $G_{i}$ of the diagonal step in the following order: $G_{7}, G_{4}, G_{5}$, and $G_{6}$. We check for instance that the ALU outputs the new value of $v_{4}$ (last instruction of $G_{0}$ ) at time $\tau+3$. If we load $v_{3}$ from the register file, we can start the computation of $G_{7}$ at time $\tau+4$ (Figure 13). We easily check that this scheduling also avoids pipeline bubbles between a diagonal step and a column step (Figure 14). Since each call to $G_{i}$ involves ten instructions, we need 80 clock cycles to perform a round of BLAKE- $n$.

Our first architecture can be modified to support the four algorithms of the BLAKE family (Figure 15). The 64 -bit datapath is built out of two 32-bit datapaths, thus allowing us to perform a single 64 -bit operation or two 32 -bit operations at each clock cycle. The mode of operation is selected according to an additional control bit $\operatorname{ctrl}_{6}$, the latter being provided by the user. The ALU includes two 32-bit adders. Let $a_{\text {low }}, a_{\text {high }}, b_{\text {low }}, b_{\text {high }}, s_{\text {low }}$, and $s_{\text {high }}$ denote unsigned 32 -bit integers. When the user chooses BLAKE-224 or BLAKE-256 $\left(\operatorname{ctrl}_{6}=0\right)$, two messages are processed in parallel and the


Figure 11. Implementation of the $G_{i}$ function of BLAKE- $n$ by means of three instructions. R 5 denotes an internal register of the ALU.


Figure 12. Arithmetic and Logic Unit for BLAKE-n.

ALU performs two 32-bit additions:

$$
\begin{aligned}
s_{\text {low }} & \leftarrow a_{\text {low }}+b_{\text {low }}+\operatorname{ctr}_{2} \\
s_{\text {high }} & \leftarrow a_{\text {high }}+b_{\text {high }}+\operatorname{ctrl}_{2}
\end{aligned}
$$

When the coprocessor executes BLAKE-384 or BLAKE-512 $\left(\operatorname{ctrl}_{6}=1\right)$, the ALU carries out a 64 -bit addition. The first adder generates the least significant bits of the sum and a carry bit $c$ such that:

$$
2^{32} \cdot c+s_{\text {low }}=a_{\text {low }}+b_{\text {low }}+\operatorname{ctrl}_{2}
$$

The second adder computes the most significant bits of the sum:

$$
s_{\text {high }} \leftarrow a_{\text {high }}+b_{\text {high }}+c
$$

We use the rotation unit of our first processor to deal with BLAKE-224 and BLAKE-256. Note that the content of R3 is always copied into R 6 when $\operatorname{ctrl}_{5: 3}=(000)_{2}$. Thus, we share this datapath between all algorithms of the BLAKE family, and need only 64 LUTs to implement the rotation unit of BLAKE-384 and BLAKE-512. When $\mathrm{ctrl}_{5}$ is equal to one, $\operatorname{ctrl}_{4: 3}$ encodes the index $i$ of the rotation distance $\delta_{i}$ (Table $V$ ). Consequently, we can use the same instruction flow for all algorithms and select the width of the datapath according to $\operatorname{ctrl}_{6}$. Note that the three pipeline configurations defined for our first coprocessor are also available here.

The QUARTERROUND function of ChaCha requires only two of the instructions we defined for the $G_{i}$ function. Thus,


Figure 13. Avoiding pipeline bubbles between a column step and a diagonal step.


Figure 14. Avoiding pipeline bubbles between a diagonal step and a column step.

Table V
Rotation distances of the unified BLAKE COprocessor.

| ctrl $_{5: 3}$ | Rot. dist. | BLAKE-224/256 | BLAKE-384/512 |
| :---: | :---: | :---: | :--- |
| $(000)_{2}$ | 0 | $\mathrm{R}_{6} \leftarrow \mathrm{R}_{3}($ common datapath $)$ |  |
| $(100)_{2}$ | $\delta_{0}$ | $\mathrm{R}_{6} \leftarrow \mathrm{R}_{3} \ggg 7$ | $\mathrm{R}_{6} \leftarrow \mathrm{R}_{3} \ggg 11$ |
| $(101)_{2}$ | $\delta_{1}$ | $\mathrm{R}_{6} \leftarrow \mathrm{R}_{3} \ggg 8$ | $\mathrm{R}_{6} \leftarrow \mathrm{R}_{3} \ggg 16$ |
| $(110)_{2}$ | $\delta_{2}$ | $\mathrm{R}_{6} \leftarrow \mathrm{R}_{3} \ggg 12$ | $\mathrm{R}_{6} \leftarrow \mathrm{R}_{3} \ggg 25$ |
| $(111)_{2}$ | $\delta_{3}$ | $\mathrm{R}_{6} \leftarrow \mathrm{R}_{3} \ggg 16$ | $\mathrm{R}_{6} \leftarrow \mathrm{R}_{3} \ggg 32$ |

the design of a ChaCha coprocessor is rather straightforward (Figure 17). Since it is not necessary to compute $\mathrm{RF}_{A} \oplus \mathrm{RF}_{B}$ anymore, the ALU has a single 32 -bit input. The only difficulty is to increment the 64 -bit counter $2^{32} m_{13}+m_{12}$ (Algorithm 3, lines 17 to 20. Assume that the constant 1 is stored in the register file. The control bit $\operatorname{ctrl}_{0}$ allows us to disable the feedback mechanism and to load the constant 0 in register R2. Execute the following instructions:

$$
\begin{array}{ll}
\mathrm{R} 1 \leftarrow 1 & \mathrm{R} 2 \leftarrow 0 \\
\mathrm{R} 3 \leftarrow \mathrm{R} 1 \oplus \mathrm{R} 2, & \\
\mathrm{R} 4 \leftarrow \mathrm{R} 3, & \\
\mathrm{R} 5 \leftarrow \mathrm{R} 4, & \\
\mathrm{R} 1 \leftarrow m_{12} & \mathrm{R} 2 \leftarrow \mathrm{R} 5 .
\end{array}
$$

Registers R1 and R2 store $m_{12}$ and the constant 1, respectively.

Note that the output carry of the 32-bit adder can now be stored in a flip-flop F. Furthermore, when $\mathrm{ctrl}_{2}$ is set to one, our ALU performs an "add with carry" instruction. We can now compute $m_{12}+1$, save the output carry in $F$, and increment $m_{13}$ if necessary:

$$
\begin{array}{rlrl}
\quad(\mathrm{F}, \mathrm{R} 3) & \leftarrow \mathrm{R} 1 \boxplus \mathrm{R} 2 & & \mathrm{R} 1 \leftarrow m_{13} \\
\mathrm{R} 3 & \leftarrow \mathrm{R} 2 \leftarrow 0, \\
\text { Register file } & \leftarrow \mathrm{R} 4\left(m_{12}\right) & & \mathrm{R} 4 \leftarrow \mathrm{~F} 4, \\
\text { Register file } & \leftarrow \mathrm{R} 4\left(m_{13}\right) . & &
\end{array}
$$

Three pipeline configurations are again available. The second one needs specific attention: since the adder is pipelined, the computation of $m_{12} \boxplus 1$ requires two clock cycles. It is therefore mandatory to introduce a NOP before loading $m_{13}$ into register R1.

It is of course possible to build a unified coprocessor for ChaCha and the BLAKE family (Figure 17). A new control bit $\operatorname{ctrl}_{7}$ allows the user to select the mode of operation of the ALU: encryption or hashing. Since the coprocessor has a 64-bit datapath to support BLAKE-384 and BLAKE-512, it is possible to encrypt two messages in parallel with ChaCha.


Figure 15. Unified Arithmetic and Logic Unit for the BLAKE family.


Figure 16. Arithmetic and Logic Unit for ChaCha.

## C. Register Files and Control Units

We will consider our unified coprocessor for the BLAKE and ChaCha algorithms to describe how we design our control units. The same approach can easily be applied to the other coprocessors considered in this work. Virtex-6 FPGAs embed several configurable memory blocks that can for instance store 102436 -bit words or 2048 18-bit words. Our control unit mainly consists of a program counter that addresses an instruction memory implemented by means of a memory block.
A straightforward way to deal with the permutations involved in the BLAKE family is to unroll the round loop.

Table VI
Number of instructions of the algorithms of the BLAKE and ChaCha families.

| Algorithm | \# instructions |
| :---: | :---: |
| BLAKE-224/256 | 1184 |
| BLAKE-384/512 | 1344 |
| 8-round ChaCha | 311 |
| 12-round ChaCha | 439 |
| 20-round ChaCha | 695 |

Table VI summarizes the number of instructions required by the algorithms supported by our coprocessor if we follow this


Figure 17. Unified Arithmetic and Logic Unit for the BLAKE and ChaCha families.
approach. Note that it suffices to store the code of BLAKE$384 / 512$ and 20 -round ChaCha (2039 instructions): a simple finite-state machine allows us to jump to the finalization step when the desired number of rounds has been performed ${ }^{1}$. The main challenge is therefore to define control words of at most 18 bits in order to implement our instruction memory by means of a single memory block. A clever organization of the register file (Figure 18) and a simple compression algorithm allows us to achieve this goal. Two blocks of dual-ported memory configured as 256 entries of 32 bits store the message, the chaining value, the constants, and all the intermediate variables of BLAKE and ChaCha. Thus, our coprocessor requires 26 control bits (Figure 19a):

- 8 address bits and a write enable signal for port A of the register file;
- 8 address bits and a write enable signal for port B of the register file;
- 8 control bits for the ALU.

Two control bits are provided by the user: ctrl $_{7}$ allows her to select between BLAKE and ChaCha, and $\operatorname{ctrl}_{6}$ specifies the configuration of the datapath ( $2 \times 32$ bits or 64 bits). Our

[^0]organization of the data in the register file enables us to define a 20-bit instruction:

- The most significant address bit depends on the algorithm being executed, and is therefore provided by the user.
- We use ports A and B to load new data (message, salt, and counter) and save the intermediate variables computed by the ALU, respectively. Consequently, the write enable signal of port A is also given by the user.
- Let us denote by $a_{7: 0}$ the eight address bits of ports A. Note that $a_{6}$ is equal to one only when we read an initial vector and assume that the digest size is selected according to an additional control bit ctrl $_{8}$. The address bit $a_{5}$ is computed as follows:

$$
a_{5} \leftarrow \begin{cases}a_{5} & \text { when } a_{6}=0 \\ \operatorname{ctrl}_{8} & \text { otherwise }\end{cases}
$$

Thanks to this simple mechanism, the instruction flow does not depend on the digest size. Initial vectors are always read from port A.

- Since the initial vectors are neither modified nor read from port B , the second most significant address bit is always equal to zero.
Consequently, we can store 20 -bit words in the instruction memory (Figure 19p). We designed a simple compression algorithm to encode the write enable signal of port B and the six control bits $\operatorname{ctrl}_{5: 0}$ by means of five bits. A C program


Figure 18. Register file of the unified coprocessor for the BLAKE and ChaCha families.

| Port A |  |  | Port B |  |  |  | Arithmetic and logic unit |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\mathrm{ctrr}_{6}$ | $\hat{A d d r a}_{A}(8$ bits) | $W_{\text {e }}$ A |  | 0 | $\hat{A d d r ~}_{B}(8$ bits) | $W_{e_{B}}$ | $\mathrm{ctrrl}_{7}$ | $\mathrm{ctrl}_{6}$ | $\mathrm{Ctrr}_{5}$ | $\mathrm{ctrr}_{4}$ | $\mathrm{Ctrl}_{3}$ | $c_{\text {ctrl }} \mid$ | $c_{\text {ctrl }}$ |  |

(a) Address and control bits of our unified coprocessor for BLAKE and ChaCha


Figure 19. From 26- to 18 -bit instructions. Shaded cells denote control bits provided by the user.
generates the content of the instruction memory and the VHDL description of the decompression circuit. The latter involves only seven 5 -input LUTs, and stores the control bits of the ALU and the write enable signal of port B in a register. Because of this pipeline stage, it is necessary to generate the write enable signal one clock cycle in advance when we have to store a word in the register file. Our C program takes this parameter into account and organizes the control bits in the instruction memory according to the pipeline configuration. Then, it generates the compressed instruction memory. Figure 20 describes the instruction flow for the first pipeline configuration of our coprocessor:

- As explained above, the write enable signal is generated one clock cycle in advance to take the internal pipeline stage of the decompression unit into account.
- All inputs of the register file are registered, and the two control bits $\operatorname{ctrl}_{0}$ and $\operatorname{ctrl}_{1}$ must therefore be generated one clock cycle after the addresses. We take advantage of the latency of our decompression unit to synchronize the control signals.
We followed the same approach to build our control units for Threefish and Skein. The register file is organized into 64bit words, and stores a plaintext block, an internal state ( $e_{d, i}$, where $0 \leq i \leq N_{w}-1$ ), an extended block cipher key, an extended tweak, the constant $C_{240}$, and all possible values of $s$ involved in the key schedule (Figure 21). Thanks to this
approach, the word permutation $\pi(i)$ and the word rotation of the key schedule are conveniently implemented by addressing the register file accordingly. Since the round constants repeat every eight rounds (Algorithm 2, line 14, we decided to unroll eight iterations of the main loop of Threefish (Algorithm 2, lines 4 to 19 . The rotation constants $R_{d, i}$ are included in the microcode executed by the control unit. Note that our register file is designed for Threefish-1024 (i.e. $N_{w}=16$ and $N_{r}=80$ ). It is therefore straightforward to implement the two other variants of the algorithm on our architecture. The number of clock cycles required for Threefish encryption and decryption according to the key size is summarized in Table VII. Because of the output transform, $k+1$ invocations of the tweakable encryption function are necessary to hash a $k$-block message with Skein. There is a latency of 5 clock cycles between two consecutive Threefish encryption. Thus, the throughput of Skein is given by:
throughput $=\frac{8 \cdot N_{b} \cdot k \cdot f}{(k+1) \cdot \text { latency of Threefish with UBI }+5 \cdot k}$,
where $f$ denotes the clock frequency.


## VII. Results and Comparisons

We captured our architecture in the VHDL language and prototyped our coprocessors on a Xilinx Virtex-6 FPGA with average speedgrade. Tables VIII and IX summarize our place-and-route results measured with ISE 14.2. Note that we


Figure 20. Generation of the compressed instruction memory.


Figure 21. Register file of our Threefish and Skein architectures.

Table VII
Number of instructions of the algorithms of the Threefish FAMILY.

| Algorithm | \# instructions |
| :---: | :---: |
| Threefish-256 encryption | 490 |
| Threefish-256 encryption with UBI | 501 |
| Threefish-256 decryption | 469 |
| Threefish-512 encryption | 860 |
| Threefish-512 encryption with UBI | 882 |
| Threefish-512 decryption | 1092 |
| Threefish-1024 encryption | 1874 |
| Threefish-1024 encryption with UBI | 1912 |
| Threefish-1024 decryption | 2351 |

considered the least favorable case, where the message consists of a single block, to compute the throughput of Skein.

Most of the architectures described in the open literature focus on a single level of security (Table X]. We took advantage of the intrinsic parallelism of BLAKE to interleave the computation of four instances of the $G_{i}$ function. Thanks to this approach, we designed an ALU with four pipeline stages and achieved higher clock speeds than the coprocessors listed in Table X A careful scheduling allowed us to totally avoid pipeline bubbles and memory collisions. We also addressed FPGA-specific issues and described how to share slices between addition and bitwise exclusive OR of two operands. We followed the same strategy to design our coprocessors for Threefish and Skein. As a consequence, our coprocessors provide the end-user with hashing and encryption at all levels
of security, while offering a better area-time trade-off.


Figure 22. Compact implementations of several cryptographic hash functions on Virtex-6 FPGAs (512-bit digests).

We report in Figure 22 the latest lightweight implementation results of several cryptographic hash functions. Besides our coprocessors for BLAKE-512 and Skein-512-512, we selected Grøstl [12], JH [5], SHA-2-512 [20], and SHA-3-512 (Keccak $[r=1024, c=576]$ ) [11]. In this context, BLAKE is obviously the best choice for lightweight implementations on FPGA. Since our unified architecture for the BLAKE family (Figure 15) requires less than 100 Virtex-6 slices, BLAKE is also an excellent candidate for cryptographic coprocessors supporting several levels of security.
We already proposed lightweight implementations of ECHO \& AES [13] and Grøstl \& AES ${ }^{2}$ [12] (Table XI). According to our results, the unified coprocessor for BLAKE and ChaCha offers the best area-time trade-off. However, given that all symmetric cryptographic functions (including authenticated encryption) can be efficiently implemented with Keccak, we would get the following figures with a unified architecture based on [11]:
${ }^{2}$ Note that Järvinen 21 proposed the first unified coprocessor for AES128 (encryption and key expansion) and Grøstl-256. Recently, Rogawski \& Gaj [22] designed a parallel coprocessor for Grøstl-based HMAC and AES in the counter mode. Both architectures are optimized for high-speed implementations, and it is therefore difficult to make a comparison with our lightweight coprocessors.

Table VIII
Place-and-route results for our Threefish and Skein coprocessors on a Virtex-6 FPGA (Xc6vlx75t-2). All designs require three MEMORY BLOCKS.

| Supported Algorithms | Area [Slices] | Freq. [MHz] | Throughput [Mbits/s] |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  | Skein-256-256 | Skein-512-512 | Threefish-256 |  | Threefish-512 |  | Threefish-1024 |  |
|  |  |  |  |  | Enc. | Dec. | Enc. | Dec. | Enc. | Dec. |
| Threefish encryption | 145 | 294 | - | - | 153 | - | 175 | - | 160 | - |
| Threefish | 277 | 267 | - | - | 139 | 145 | 158 | 125 | 145 | 116 |
| Skein and Threefish encryption | 150 | 295 | 75 | 85 | 154 | - | 175 | - | 161 | - |
| Skein and Threefish | 292 | 279 | 70 | 80 | 145 | 152 | 166 | 130 | 152 | 121 |

Table IX
Place-and-route results for our ChaCha and BLAKE coprocessors on a Virtex-6 FPGA (Xc6vlx75t-2).

| Supported algorithms | Pipeline config. | Area [slices] | \# block <br> RAMs | Freq. <br> [MHz] | Throughput [Mbits/s] |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  | $\begin{aligned} & \text { BLAKE-224 } \\ & \text { and } \\ & \text { BLAKE-256 } \end{aligned}$ | $\begin{aligned} & \text { BLAKE-384 } \\ & \text { and } \\ & \text { BLAKE-512 } \end{aligned}$ | 8-round ChaCha | 12-round ChaCha | 20-round ChaCha |
| ChaCha | (1) | 49 | 2 | 362 | - | - | 595 | 422 | 266 |
|  | (2) | 77 | 2 | 316 | - | - | 520 | 368 | 232 |
|  | (3) | 77 | 2 | 345 | - | - | 569 | 403 | 254 |
| BLAKE-224 and <br> BLAKE-256 | (1) | 47 | 2 | 338 | 146 | - | - | - | - |
|  | (2) | 49 | 2 | 341 | 147 | - | - | - | - |
|  | (3) | 50 | 2 | 349 | 150 | - | - | - | - |
| BLAKE-384 and BLAKE-512 | (1) | 79 | 3 | 331 | - | 252 | - | - | - |
|  | (2) | 91 | 3 | 331 | - | 252 | - | - | - |
|  | (3) | 91 | 3 | 329 | - | 250 | - | - | - |
| BLAKE <br> (all levels of security) | (1) | 94 | 3 | 312 | $2 \times 134$ | 237 | - | - | - |
|  | (2) | 126 | 3 | 332 | $2 \times 143$ | 252 | - | - | - |
|  | (3) | 129 | 3 | 343 | $2 \times 148$ | 261 | - | - | - |
| BLAKE and ChaCha (all levels of security) | (1) | 144 | 3 | 335 | $2 \times 144$ | 255 | $2 \times 551$ | $2 \times 390$ | $2 \times 246$ |
|  | (2) | 156 | 3 | 289 | $2 \times 124$ | 220 | $2 \times 475$ | $2 \times 337$ | $2 \times 212$ |
|  | (3) | 168 | 3 | 304 | $2 \times 131$ | 231 | $2 \times 500$ | $2 \times 354$ | $2 \times 223$ |

- hashing with arbitrary length at a security level of 256 bits: 501 Mbits/s;
- authenticated encryption at a security level of 256 bits: more than $501 \mathrm{Mbits} / \mathrm{s}$ (the generic security of keyed sponges allows one to use less capacity than for hashing, hence a larger rate and a proportionally larger throughput) [23].


## VIII. Conclusion

The stream cipher ChaCha, the block cipher Threefish, and the hash functions BLAKE and Skein are based on the same arithmetic operations. In this work, we showed that the same design philosophy allows one to design lightweight coprocessors for hashing and encryption. The key element of our approach is to take advantage of the parallelism of the algorithms to:

- deeply pipeline the ALU to achieve a high clock frequency;
- avoid data dependencies by interleaving independent tasks.

Furthermore, we described how to design compact control units thanks to a careful organization of the register file, loop
unrolling, and a simple compression algorithm. Our architectures are mainly designed for embedded systems. Thus, it would be interesting to conduct side-channel and fault injection attacks in future work.

Our results show that BLAKE and ChaCha are excellent candidates for lightweight coprocessors. However, since all symmetric cryptographic functions can be implemented by means of keyed sponges, we are planning to design hardware architectures based on the new SHA-3 algorithm. According to the preliminary results reported in [11], SHA-3 could outperform BLAKE and ChaCha.

## Acknowledgements

The authors would like to thank Ray Cheung for his valuable comments. This work was partially supported by the Japanese Society of Promotion of Science (JSPS) through the A3 Foresight Program (Research on Next Generation Internet and Network Security). Additionally the authors would like to acknowledge Xilinx and the Xilinx University Program for its generous donation of materials in terms of design tools.

## References

[1] J.-P. Aumasson, L. Henzen, W. Meier, and R. Phan, "SHA-3 proposal BLAKE (version 1.4)," Jan. 2011, available at http://www.131002.net/ blake

Table X
Compact implementations of BLAKE and Skein on Virtex- 5 and Virtex-6 FPGAs. The throughput is computed for a one-block MESSAGE.

|  | Supported algorithm(s) | FPGA | Area [slices] | 36k memory blocks | Frequency [MHz] | $\begin{gathered} \text { Throughput } \\ \text { [Mbits/s] } \\ \hline \end{gathered}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Latif et al. $\mathbf{1 6}^{\ddagger}$ | Skein-256-256 | xc5vlx110-3 | 821 | Not specified | 119 | 1610 |
| Jungk [17 ${ }^{\ddagger}$ | Skein-512-256 | xc5v | 555 | - | 271 | 237 |
| Jungk $\overline{18}^{\text {T}}$ | Skein-512-256 | xc6v | 406 | - | 318 | 277 |
| Kaps et al. 19] | Skein-512-256 | xc6vlx75t-1 | 207 | 1 | 166 | 17 |
| Kaps et al. 19 | Skein-512-256 | xc6vlx $75 \mathrm{t}-1$ | 193 | - | 193 | 21 |
| Kerckhof et al. $\|5\|^{\ddagger}$ | Skein-512-512 | xc6vlx $75 \mathrm{t}-1$ | 240 | - | 160 | 179 |
| Aumasson et al. [1] | BLAKE-256 | xc5vlx110 | 390 | - | 91 | 412 |
| Jungk 18 | BLAKE-256 | xc6v | 235 | - | 231 | 518 |
| Jungk [18] | BLAKE-256 | xc6v | 404 | - | 185 | 823 |
| Kaps et al. 19] | BLAKE-256 | xc6vlx75t-1 | 163 | 1 | 197 | 327 |
| Kaps et al. 19 | BLAKE-256 | xc6vlx75t-1 | 166 | - | 268 | 445 |
| Aumasson et al. [1] | BLAKE-512 | xc5vlx110 | 939 | - | 59 | 468 |
| Kerckhof et al. 5 ] | BLAKE-512 | xc6vlx75t-1 | 192 | - | 240 | 183 |

Table XI
Place-and-route results for hashing and AES encryption on a Virtex-6 FPGA (XC6VLX75T-2).

| FPGA | Area | Frequency |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | [MHz] |  | AES-128 | AES-192 | AES-256 | 256-bit digest | 512-bit digest |
| AES \& ECHO $[\mathbf{1 3}$ | 155 | 397 | 219 | 186 | 161 | 92 | 48 |
| AES \& Grøstl 12 | 169 | 393 | 217 | 184 | 159 | 92 | 69 |

[2] N. Ferguson, S. Lucks, B. Schneier, D. Whiting, M. Bellare, T. Kohno, J. Callas, and J. Walker, "The skein hash function family (version 1.3)," Oct. 2010, available at http://www.skein-hash.info
[3] D. Bernstein, "ChaCha, a variant of Salsa20," Jan. 2008, available at http://cr.yp.to/papers.html\#chacha
[4] J. Zhai, C. Park, and G.-N. Wang, "Hash-based RFID security protocol using randomly key-changed identification procedure," in Computational Science and Its Applications-ICCSA 2006, ser. Lecture Notes in Computer Science, M. Gavrilova, O. Gervasi, V. Kumar, C. K. Tan, D. Taniar, A. Laganà, Y. Mun, and H. Choo, Eds., no. 3983. Springer, 2006, pp. 296-305.
[5] S. Kerckhof, F. Durvaux, N. Veyrat-Charvillon, F. Regazzoni, G. Meurice de Dormale, and F.-X. Standaert, "Compact FPGA implementations of the five SHA-3 finalists," in Proceedings of the ECRYPT II Hash Workshop, 2011.
[6] N. At, J.-L. Beuchat, and İ. San, "Compact implementation of Threefish and Skein on FPGA," in Proceedings of the Fifth IFIP International Conference on New Technologies, Mobility and Security-NTMS 2012, A. Levi, M. Badra, M. Cesana, M. Ghassemian, O. Gürbüz, N. Jabeur, M. Klonowski, A. Maña, S. Sargento, and S.Zeadally, Eds. IEEE eXpress Conference Publishing, 2012.
[7] J.-L. Beuchat, E. Okamoto, and T. Yamazaki, "Compact implementations of BLAKE-32 and BLAKE-64 on FPGA," in Proceedings of the 2010 International Conference on Field-Programmable TechnologyFPT 2010, J. Bian, Q. Zhou, and K. Zhao, Eds. IEEE Press, 2010, pp. 170-177.
[8] D. Bernstein, "The Salsa20 family of stream ciphers," Dec. 2007, available at http://cr.yp.to/snuffle/salsafamily-20071225.pdf
[9] J.-P. Aumasson, S. Fischer, S. Khazaei, W. Meier, and C. Rechberger, "New features of Latin dances: Analysis of Salsa, ChaCha, and Rumba," in Fast Software Encryption-FSE 2008, ser. Lecture Notes in Computer Science, K. Nyberg, Ed., vol. 5086. Springer, 2008, pp. 470-488.
[10] T. Ishiguro, S. Kiyomoto, and Y. Miyake, "Latin dances revisited: New analytic results of Salsa20 and ChaCha," in Information and Communications Security-ICICS 2011, ser. Lecture Notes in Computer Science, S. Qing, W. Susilo, G. Wang, and D. Liu, Eds., vol. 7043. Springer, 2011, pp. 255-266.
[11] İ. San and N. At, "Compact Keccak hardware architecture for data
integrity and authentication on FPGAs," Information Security Journal: A Global Perspective, vol. 21, no. 5, pp. 231-242, 2012.
[12] N. At, J.-L. Beuchat, E. Okamoto, İ. San, and T. Yamazaki, "A low-area unified hardware architecture for the AES and the cryptographic hash function Grøstl," 2012, cryptology ePrint Archive, Report 2012/535.
[13] J.-L. Beuchat, E. Okamoto, and T. Yamazaki, "A low-area unified hardware architecture for the AES and the cryptographic hash function ECHO," Journal of Cryptographic Engineering, vol. 1, no. 2, pp. 101121, 2011.
[14] B. Baldwin, A. Byrne, L. Lu, M. Hamilton, N. Hanley, M. O'Neill, and W. Marnane, "A hardware wrapper for the SHA-3 hash algorithms," 2010, cryptology ePrint Archive, Report 2010/124.
[15] H. Warren, Hacker's Delight. Addison-Wesley, 2002.
[16] K. Latif, M. Tariq, A. Aziz, and A. Mahboob, "Efficient hardware implementation of secure hash algorithm (SHA-3) finalist - Skein," in Proceedings of the International Conference on Computer, Communication, Control and Automation-3CA2011, 2011.
[17] B. Jungk, "Compact implementations of Grøstl, JH and Skein for FPGAs," in Proceedings of the ECRYPT II Hash Workshop, 2011.
[18] -, "Evaluation of compact FPGA implementations for all SHA-3 finalists," in The Third SHA-3 Candidate Conference, Mar. 2012.
[19] J.-P. Kaps, P. Yalla, K. Surapathi, B. Habib, S. Vadlamudi, and S. Gurung, "Lightweight implementations of SHA-3 finalists on FPGAs," in The Third SHA-3 Candidate Conference, Mar. 2012.
[20] H. Technology, "FULL DATASHEET-Tiny hash core family for Xilinx FPGA," revision 2.0 (11/06/2010).
[21] K. Järvinen, "Sharing resources between AES and the SHA-3 second round candidates Fugue and Grøstl," in The Second SHA-3 Candidate Conference, Aug. 2010.
[22] M. Rogawski and K. Gaj, "A high-speed unified hardware architecture for AES and the SHA-3 candidate Grøstl," in Proceedings of the 15th Euromicro Conference on Digital System Design, Sep. 2012.
[23] The Keccak Team, Personal communication, Sep. 2012.


[^0]:    ${ }^{1}$ It is possible to reduce the size of the code by storing the table defining the permutation of $\{0 \ldots, 15\}$ parametrized by the round index $r$ (Table IV) and by generating the addresses of $m_{\sigma_{r}(2 i)}$ and $c_{\sigma_{r}(2 i+1)}$ on-the-fly. However, this approach would require a more complex control unit. As long as the micro-code fits into a single block of memory, there is no need to try to reduce the number of instructions.

