# Scrutinizing the Tower Field Implementation of the $\mathbb{F}_{2^{8}}$ Inverter - with Applications to AES, Camellia, and SM4 

Zihao Wei ${ }^{1,2,3}$ • Siwei Sun ${ }^{1,2,3 *}$. Lei $\mathbf{H u}^{1,2,3} \cdot \mathbf{M a n} \mathbf{W e i}{ }^{1,2,3}$. Joan Boyar ${ }^{4}$. René Peralta ${ }^{5}$

Received: date / Accepted: date


#### Abstract

The tower field implementation of the $\mathbb{F}_{2^{8}}$ inverter is not only the key technique for compact implementations of the S-boxes of several internationally standardized block ciphers such as AES, Camellia, and SM4, but also the underlying structure many side-channel attack resistant AES implementations rely on. In this work, we conduct an exhaustive study of the tower field representations of the $\mathbb{F}_{2^{8}}$ inverter with normal bases by applying several state-of-the-art combinatorial logic minimization techniques. As a result, we achieve improved implementations of the AES, Camellia and SM4 S-boxes in terms of area footprint. Surprisingly, we are still able to improve the currently known most compact implementation of the AES S-box from CHES 2018 by 5.5 GE, beating the record again (Excluding this work, the latest improvement was proposed at CHES 2018, which achieves 11.75 GE improvement over the optimal implementation at the time). For Camellia and SM4, the improvements are even more significant. The Verilog codes of our implementations of the AES, Camellia and SM4 S-boxes are openly available.


Keywords Tower field • Inverter • S-box • AES • Camellia • SM4

## 1 Introduction

For encrypting and authenticating the largest part of the workload of today's secure communication, symmetric-key primitives are regarded as the crypto workhorse (whereas public-key schemes are generally used for setting up the session keys). In many cases, the components of symmetric-key schemes are built on operations over finite fields. Since the symmetric-key cryptographic algorithms will eventually be implemented in software or hardware to play a role in the real world, it is of great importance to investigate how to implement their common operations

[^0]efficiently $[6,9,18,30]$. For instance, due to the rapid development of lightweight IoT devices, ongoing efforts have been made to obtain more compact ASIC implementations of symmetric-key ciphers [4, 5, 17]. Just recently, the most compact implementation of the MixColumns and the S-box of AES were reported at FSE 2018 [19] and CHES 2018 [26] respectively.

In this work, we focus on area-optimized implementations of the multiplicative inverse operation (and its affine equivalences) over $\mathbb{F}_{2^{8}}$. The AES S-box, which is affine equivalent to the $\mathbb{F}_{2^{8}}$ inverter, is the strongest $8 \times 8 \mathrm{~S}$-box known so far in terms of local security properties (i.e., non-linearity, differential uniformity, algebraic degree, etc.). Several internationally standardized block ciphers, such as Camellia and SM4, apply variants of the AES S-box in their designs, which are all affine equivalent to the $\mathbb{F}_{2^{8}}$ inverter. Despite its desirable local cryptographic properties, to implement the AES S-box in ASIC with small footprint is not a trivial task. The naive approach that encodes the AES S-box as a look-up table in a hardware description language and produces the actual circuit relying on open-source or commercial CAD tools will certainly lead to unsatisfactory results for many resource constrained applications. Today's most compact ASIC implementations of the AES S-box are based on the tower field architecture, where the operations over $\mathbb{F}_{2^{2 k}}$ are represented with operations over $\mathbb{F}_{2^{k}}$ recursively.

Moreover, several cost-effective threshold implementations of the AES S-box with resistance against side-channel attacks are built on top of the tower field architecture $[24,8,12,7]$. In threshold implementations, the most important design consideration includes the security level, number of fresh random bits required, and area consumption. Therefore, providing different implementations of the tower field structure without increasing the circuit footprint potentially offers more flexible area-randomness-security trade-off in threshold implementations.

Apart from these, breaking the AES S-box into several layers with the tower field architecture allows registers to be inserted into the middle of the computation such that the critical path can be reduced, and therefore the frequency of the system clock can be increased to boost the performance.

Related work. The tower field architecture was first proposed by Itoh and Tsujii for computing multiplicative inverse in finite fields of characteristic two [16]. At the beginning, it was applied in developing efficient implementations of publickey cryptographic algorithms involving inverse operations over $\mathbb{F}_{2^{n}}[15,25]$. Later, after the development of the Advanced Encryption Standard (AES) - a block cipher using an $S$-box affine equivalent to $\mathbb{F}_{28}$ multiplicative inverter [13], the tower field architecture found applications in compact hardware implementations of AES [27,29, 23, 32]. After a series of improvements, Canright [11,10] and Boyar et al. $[9,31]$ achieved the most compact implementations at the time, which has become the de facto standard for compact implementations of the AES S-box. Such tower field implementations of AES S-box were also intensively applied in side-channel resistant implementations of AES to reduce resource consumption. Recently Reyhani-Masoleh et al. [26] broke the record set by Canright and Boyar et al., presenting so far the most compact ASIC implementation of the AES S-box, which costs 182.25 GE under the STM 65 nm CMOS technology.

Due to the strong local cryptographic properties of the AES S-box, several well known block ciphers employ affine equivalences of the AES S-box as their S-boxes,
including Camellia and SM4. Therefore, the technique of tower field implementation naturally applies to these ciphers $[28,21,22,3,1]$.

In tower field implementations, a sequence of field extensions starting from $\mathbb{F}_{2}$ and ending at $\mathbb{F}_{2^{8}}$ of the type $\mathbb{F}_{2^{k}} \subseteq \mathbb{F}_{\left(2^{k}\right)^{2^{l}}}$ is considered. At each level of the field extension, an irreducible polynomial over $\mathbb{F}_{2^{k}}$ and a corresponding basis of $\mathbb{F}_{\left(2^{k}\right)^{2^{l}}}$ over $\mathbb{F}_{2^{k}}$ have to be specified. The irreducible polynomials and bases induce a basis of $\mathbb{F}_{2^{8}}$ over $\mathbb{F}_{2}$. The tower field architecture is implemented over this new basis with proper basis transformations to maintain the original field representation. Therefore, the choices of the field extensions, the corresponding irreducible polynomials and the bases determine the overall cost of the implementation. A summary of the choices of existing work is given in Table 1 for AES, Camellia and SM4 respectively.

Table 1: Previous tower field implementations of the AES, Camellia and SM4 Sboxes, where $\mathcal{P}$ means a polynomial basis is used, $\mathcal{N}$ means a normal basis is used, and \#Cases denotes the number of cases considered.

| Cipher | Source | Tower field architecture and basis | \#Cases |
| :---: | :---: | :---: | :---: |
| AES | [27][32] | $\mathbb{F}_{2} \xrightarrow{w^{4}+w+1} \mathbb{F}_{2^{4}} \xrightarrow{y^{2}+y+C_{1}} \mathbb{P}^{\text {a }}{ }^{8}$ | 1 |
|  | [29] | $\mathbb{F}_{2} \xrightarrow{\frac{w^{2}+w+1}{\mathcal{P}}} \mathbb{F}_{2^{2}} \stackrel{z^{2}+z+C_{2}}{\mathcal{P}} \mathbb{F}_{2^{4}} \xrightarrow{\frac{y^{2}+y+C_{3}}{\mathcal{P}}} \mathbb{F}_{2^{8}}$ | 1 |
|  | [23] | $\mathbb{F}_{2} \xrightarrow{w^{2}+w+1} \mathbb{P}_{2^{2}} \xrightarrow{z^{2}+z+C_{2}}$ P $\mathbb{F}_{2^{4}} \xrightarrow{y^{2}+y+\nu} \mathbb{P}^{\text {P }}$ ( ${ }^{8}$ | 64 |
|  | [11] | $\mathbb{F}_{2} \xrightarrow{w^{2}+w+1} \mathbb{F}_{2^{2}} \xrightarrow[\mathcal{N}]{ } \xrightarrow{z^{2}+z+N} \mathbb{P}^{(N / \mathcal{N}} \mathbb{F}_{2}{ }^{4} \xrightarrow{y^{2}+y+\nu} \mathbb{F}_{2}{ }^{\text {d }}$ | 432 |
|  | [9] |  | 1 |
|  | [26] | $\mathbb{F}_{2} \xrightarrow{w^{4}+w^{3}+w^{2}+w+1} \mathbb{N}_{2^{4}} \xrightarrow{y^{2}+y+\nu} \mathbb{N}^{(1)} \mathbb{F}_{2}{ }^{8}$ | 32 |
| Camellia | [28] | $\mathbb{F}_{2} \xrightarrow{w^{4}+w+1} \mathbb{F}_{2^{4}} \xrightarrow{y^{2}+y+C_{6}} \mathbb{P}_{2^{8}}$ | 1 |
|  | [21] | $\begin{gathered} \mathbb{F}_{2} \xrightarrow[w^{4}+w+1]{\mathcal{P} / \mathcal{N}} \mathbb{F}_{2^{4}} \frac{y^{2}+\tau y+\nu}{\mathcal{P} / \mathcal{N}} \mathbb{F}_{2^{8}} \\ \mathbb{F}_{2} \xrightarrow[\mathcal{P} / \mathcal{N}]{w^{2}+w+1} \mathbb{F}_{2^{2}} \xrightarrow[\mathcal{P} / \mathcal{N}]{z^{2}+T z+N} \mathbb{F}_{2^{4}} \xrightarrow[\mathcal{P} / \mathcal{N}]{y^{2}+\tau y+\nu} \mathbb{F}_{2^{8}} \end{gathered}$ | 13 |
| SM4 | [22] | $\mathbb{F}_{2} \xrightarrow{w^{4}+w+1} \mathbb{F}_{2}{ }^{4} \xrightarrow{y^{2}+C_{7} y+C_{8}} \xrightarrow[\mathcal{P} / \mathcal{N}]{ } \mathbb{F}_{2}{ }^{8}$ | 4 |
|  | [3] | $\mathbb{F}_{2} \xrightarrow{w^{2}+w+1} \underset{\mathcal{P}}{\longrightarrow} \mathbb{F}_{2^{2}} \frac{z^{2}+z+C_{9}}{\mathcal{P}} \mathbb{F}_{2^{4}} \xrightarrow{y^{2}+y+C_{10}} \mathbb{F}_{2^{8}}$ | 1 |
|  | [1] | $\mathbb{F}_{2} \xrightarrow{w^{2}+w+1} \underset{\mathcal{N}}{\longrightarrow} \mathbb{F}_{2^{2}} \xrightarrow[\mathcal{N}]{z^{2}+z+N} \mathbb{F}_{2^{4}} \xrightarrow[\mathcal{N}]{y^{2}+y+\nu} \mathbb{F}_{2^{8}}$ | 16 |

Contributions. As shown in Table 1, only a part of the design space of tower field implementation was explored by choosing irreducible polynomials of special forms in previous work. In particular, previous work preferred a class of parameter choices where the irreducible polynomials selected for the field extension $\mathbb{F}_{2^{2}} \subseteq$ $\mathbb{F}_{2^{4}} \subseteq \mathbb{F}_{2^{8}}$ are of the form $z^{2}+z+N$ and $y^{2}+y+\nu$, and indeed the most well known implementations of Canright's and Boyar et al.'s schemes are in this class $[11,9]$. Although some work considered other irreducible polynomials, no systematic investigation was conducted [21]. The preference for this special class
is reasonable, since with these choices of parameters, the implementations of some subcomponents of the circuit are free. Despite this heuristic, there is no concrete evidence that this configuration will result in optimal implementations. Therefore, we exhaustively examine all possible tower field representations under normal bases induced by irreducible polynomials ( 720 cases in total), and find several cases which are never considered previously enjoy the most compact implementations. Along the way, we do not only apply well-known logic minimization techniques from Canright and Boyar et al., but also resort to several state-of-the-art combinatorial logic minimization techniques $[14,30,18]$ developed in recent years. As a result, we beat the new record set by the work of Reyhani-Masoleh et al. [26] for compact implementations of the AES S-box. Moreover, the implementations of the Camellia and SM4 S-boxes are improved significantly, and we refer readers to Table 2 for a summary of the results. Naturally, these results serve to achieve more compact implementations of AES, Camellia and SM4, and potentially provide more flexible security-randomness-area trade-offs for threshold implementations of these block ciphers. The Verilog codes of our implementation of the AES, Camellia and SM4 S-boxes are provided in the Appendix.
Organization. In Section 2, we give a brief introduction of the mathematical background of the tower field representations under different bases, as well as the frequently-used logic gates for constructing digital circuits. Subsequently, we describe the details of the tower field implementation of the $\mathbb{F}_{2^{8}}$ inverter in Section 3. In Section 4, we apply state-of-the-art logic minimization techniques to a list of tower filed representations of the AES, Camellia and SM4 S-box under all possible normal bases. As a result, we obtain so far the most compact implementations of these S-boxes. We conclude the paper in Section 5 and propose possible future work. The source codes of the optimized implementations for the S-boxes of AES, Camellia and SM4 are provided in the Appendix.

## 2 Preliminaries

We first give a brief introduction of the tower field representation. Then we list a set of gates together with their functionalities and areas. These gates will be used to implement the circuits constructed in this paper, and the overall area of each circuit will be computed accordingly.

Tower field representation. Let $\mathbb{F}_{2}=\{0,1\}$ be the finite field of two elements. It is well known that the field $\mathbb{F}_{2^{k}}$ with $2^{k}$ elements can be induced by an irreducible polynomial $q(x) \in \mathbb{F}_{2}[x]$ with degree $k$, i.e., $\mathbb{F}_{2^{k}} \cong \mathbb{F}_{2}[x] /(q(x))$. Assuming that $X$ is a root of $q(x)$ over $\mathbb{F}_{2^{k}}$, then every element in $\mathbb{F}_{2^{k}}$ can be represented as an $\mathbb{F}_{2}$-linear combination $b_{k-1} X^{k-1}+\cdots+b_{1} X+b_{0}$ of $\left[X^{k-1}, \cdots, X, X^{0}\right]$, which is a polynomial basis of $\mathbb{F}_{2^{k}}$ over $\mathbb{F}_{2}$. To be concrete, we take $k=8$, and we call $\left(b_{7}, \cdots, b_{0}\right)$ the bit-vector representation of $b_{k-1} X^{k-1}+\cdots+b_{1} X+b_{0}$ under the basis $\left[X^{7}, \cdots, X, X^{0}\right]$.

Considering a sequence of field extensions $\mathbb{F}_{2} \subseteq \mathbb{F}_{2^{2}} \subseteq \mathbb{F}_{2^{4}} \subseteq \mathbb{F}_{2^{8}}$ shown in Fig. 1. Let $r(y) \in \mathbb{F}_{2^{4}}[y], s(z) \in \mathbb{F}_{2^{2}}[z]$ and $t(w) \in \mathbb{F}_{2}[w]$ be irreducible polynomials over their respective fields, and let $Y \in \mathbb{F}_{2^{8}}, Z \in \mathbb{F}_{2^{4}}$ and $W \in \mathbb{F}_{2^{2}}$ be roots of $r(y), s(z)$ and $t(w)$ over the corresponding fields respectively. Then we obtain a set of normal basis: $\left[Y^{16}, Y\right]$ is a basis of $\mathbb{F}_{2^{8}}$ over $\mathbb{F}_{2^{4}},\left[Z^{4}, Z\right]$ is a basis of

| 29．82I | 92．8LI | 9z＊06I | 29．961 |  |  |  | I | 6 |  |  | \％8 |  | 99 | s．nno |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 00．91E | $00^{\circ} \mathrm{E}$ I¢ | 09．998 | L90098 |  |  |  |  |  |  | 98 |  |  | モ\＆L | ［I］ |  |
| 00.868 | 9L Z 68 | 9L2LT | \＆\％\％ 0 ¢ |  |  |  |  |  |  | ¢9 |  |  | LSI | ［¢］ |  |
| L9＇78\％ | 92．82\％ | 00．818 | 29．918 |  |  |  | II |  |  | 89 |  |  | 66 | ［zz］ |  |
| 29．24I | 92．24I | 92．t6I | 88：007 |  |  |  | I | 8 |  |  | 88 |  | 89 | s．m， | еп！ршиР |
| 29．82\％ | 09．92\％ | $0 \mathrm{c}^{\circ} \mathrm{E}$ ¢ $¢$ | 88．918 |  |  |  | 6 |  |  | 98 |  |  | \＆II | ［tz］ |  |
| V／N | 92．92L | V／N | V／N |  |  |  |  | 8 |  |  | 88 | 6 | 19 | $\mathrm{s.m}_{\mathrm{O}}$ | SGV |
| 00．62I | $00^{\circ} 62 \mathrm{I}$ | 97．96I | 00\％z0z |  |  |  |  | 8 |  |  | ¢8 |  | 69 |  |  |
| V／N | 00＇9z＇z8L | V／N | $\mathrm{V} / \mathrm{N}$ | $\pm$ |  |  | $\dagger$ | 4 |  |  | 47 | $\varepsilon$ | ¢9 | ［97］ |  |
| 00．981 | $\mathrm{V} / \mathrm{N}$ | V／N | $\mathrm{V} / \mathrm{N}$ |  | I | $L$ | 9 | $\varepsilon$ |  |  | 18 |  | 69 |  |  |
| 00．88L | 00．88L | 97＇902 | $00^{\prime}$ LIz |  |  |  | $\pm$ | $\varepsilon$ | $\pm$ |  | 68 |  | 69 |  |  |
| 00．76I | $00^{\circ} \mathrm{*} 61$ | 9でもIを | 00＇t L \％ |  |  |  |  |  |  |  | 78 |  | 18 |  |  |
| L9＇ロ0z | 00 ＇zoz | 9\％＇08\％ | 29． 188 |  |  |  |  |  |  | 28 |  |  | 18 |  |  |
| L9．807 | $00 \cdot 90 z$ | 9L＇®Ez | 88．98\％ |  |  |  |  |  |  | 78 |  |  | 88 | ［6］ |  |
| $00 \cdot 007$ | $00 \cdot 00 z$ | $00.07 \%$ | L9．97\％ |  |  |  |  | 9 |  |  | £¢ |  | 08 | ［II］ |  |
| $00^{\circ} \mathrm{Oちz}$ | $00 \cdot 28 z$ | 00.027 | $00.72 \%$ |  |  |  |  |  |  | 98 |  |  | 96 | ［\＆7］ |  |
| 00．8ちて | 00 ¢ちを | $00.62 \%$ | 88＇188 |  |  |  |  |  |  | 98 |  |  | 001 | ［67］ |  |
| ¢\＆ 667 | 0¢＇t6z | 9298¢ | ¢8．98¢ |  |  |  |  |  |  | 89 |  |  | LII | ［L7］ |  |
| uugt ə7esurn | uuga 9 WLS | uug9 गIWS | uu0\＆L गIWS | 乙عIVo | İIOY | İIVO | LON | \％ол | عanvn | aNV | anvo | عч0x | yonx／yox | ${ }^{\text {ә．ınoS }}$ | ıә૫d！ 0 |
|  |  |  |  | pəsn səұャワ |  |  |  |  |  |  |  |  |  |  |  |



$$
\begin{gathered}
\left.\mathbb{F}_{2} \xrightarrow[{\left[W^{2}, W\right.}]\right]{t(w)=w^{2}+w+1} \mathbb{F}_{2^{2}} \xrightarrow{\stackrel{s(z)=z^{2}+T z+N}{ } N} \mathbb{F}_{\left.2^{4}, Z\right]} \xrightarrow{q(x) \in \mathbb{F}_{2}[x]} \\
{\left[Y^{16}, Y\right]}
\end{gathered}
$$

Fig. 1: The tower field structure
$\mathbb{F}_{2^{4}}$ over $\mathbb{F}_{2^{2}}$, and $\left[W^{2}, W\right]$ is a basis of $\mathbb{F}_{2^{2}}$ over $\mathbb{F}_{2}$. Therefore, for an element $b=b_{7} X^{7}+\cdots+b_{1} X+b_{0} \in \mathbb{F}_{2^{8}}$ we have

$$
\begin{gathered}
b=\gamma_{1} Y^{16}+\gamma_{0} Y, \gamma_{1}, \gamma_{0} \in \mathbb{F}_{2^{4}}, \\
\gamma_{1}=\Gamma_{3} Z^{4}+\Gamma_{2} Z, \gamma_{0}=\Gamma_{1} Z^{4}+\Gamma_{0} Z, \Gamma_{3}, \Gamma_{2}, \Gamma_{1}, \Gamma_{0} \in \mathbb{F}_{2^{2}} \\
\Gamma_{3}=g_{7} W^{2}+g_{6} W, \Gamma_{2}=g_{5} W^{2}+g_{4} W, \Gamma_{1}=g_{3} W^{2}+g_{2} W, \Gamma_{0}=g_{1} W^{2}+g_{0} W \\
g_{i} \in \mathbb{F}_{2}, 0 \leq i \leq 7
\end{gathered}
$$

which implies $b=b_{7} X^{7}+\cdots+b_{1} X+b_{0}=g_{7} W^{2} Z^{4} Y^{16}+g_{6} W Z^{4} Y^{16}+g_{5} W^{2} Z Y^{16}+$ $g_{4} W Z Y^{16}+g_{3} W^{2} Z^{4} Y+g_{2} W Z^{4} Y+g_{1} W^{2} Z Y+g_{0} W Z Y$. That is, $b$ can be represented as $\left(g_{7}, \cdots, g_{0}\right)$ under the tower basis

$$
\mathcal{T B}=\left[W^{2} Z^{4} Y^{16}, W Z^{4} Y^{16}, W^{2} Z Y^{16}, W Z Y^{16}, W^{2} Z^{4} Y, W Z^{4} Y, W^{2} Z Y, W Z Y\right]
$$

induced by $W, Z$ and $Y$. We call $\left(g_{7}, \cdots, g_{0}\right)$ the bit-vector representation of $b$ under the tower basis. Assuming that the tower basis $\mathcal{T B}$ can be represented by the original polynomial basis with a matrix $M_{t} \in \mathbb{F}_{2}^{8 \times 8}$ as

$$
\mathcal{T B}=\left(X^{7}, \cdots, X^{0}\right) M_{t}
$$

we have

$$
\left(b_{7}, \cdots, b_{0}\right)^{\top}=M_{t} \cdot\left(g_{7}, \cdots, g_{0}\right)^{\top} \text { or }\left(g_{7}, \cdots, g_{0}\right)^{\top}=M_{t}^{-1} \cdot\left(b_{7}, \cdots, b_{0}\right)^{\top} .
$$

Therefore, we can change the representations by multiplying $M_{t}$ or $M_{t}^{-1}$, and we call $M_{t}$ the basis transformation matrix.

Considering the example from AES shown in Fig. 1, where $q(x)$ is the Rijdael polynomial $x^{8}+x^{4}+x^{3}+x+1, \tau=X^{7}+X^{5}+X^{4}+X^{3}+X^{2}+1, \nu=X^{7}+X^{6}+X^{5}$, $T=X^{7}+X^{5}+X^{4}+X^{3}+X^{2}+1, N=1, W=X^{7}+X^{5}+X^{4}+X^{3}+X^{2}$, $Z=X^{6}+X^{4}$ and $Y=X^{6}+X^{3}$. Then we have $\mathcal{T B}=\left(X^{7}, \cdots, X^{0}\right) \cdot M_{t}$, where

$$
M_{t}=\left(\begin{array}{lllllllll}
1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 \\
1 & 0 & 0 & 1 & 0 & 1 & 1 & 0 \\
1 & 0 & 1 & 1 & 0 & 1 & 0 & 1 \\
0 & 1 & 0 & 0 & 0 & 1 & 0 & 1 \\
0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\
1 & 0 & 1 & 1 & 1 & 1 & 1 & 1 \\
0 & 1 & 1 & 1 & 0 & 1 & 1 & 1 \\
1 & 0 & 1 & 1 & 1 & 1 & 0 & 1
\end{array}\right) .
$$

Frequently-used gates. The circuits of this paper are eventually synthesized with the gates provided in common cell libraries. We list a set of frequently-used gates in Table 3, where the area is measured in gate equivalence (GE), corresponding to the area of a two-input drive-strength-one NAND gate.

Note that apart from those common gates (XOR, XNOR, AND, NAND, OR, NOR, NOT) which are available in almost all CMOS technology libraries, we also list some compound gates (XOR3, NAND3, OAI21, AOI21, OAI32).

The data of STM 65 nm library is collected from Reyhani-Masoleh et al.'s paper [26], while the others comes from library files and databooks. The cell in blank of STM 65 nm means the corresponding gate does not appear at [26], and the cell labeled as N/A means the library does not support this kind of gate.

Table 3: Frequently-used gates in common CMOS technology libraries.

| Gate | Area (GE) |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | SMIC 130nm | SMIC 65nm | STM 65nm | Nangate 45 nm |
| XOR: $(a, b) \mapsto a \oplus b$ | 2.33 | 2.25 | 2 | 2 |
| XNOR: $(a, b) \mapsto a \odot b$ | 2.33 | 2.25 | 2 | 2 |
| XOR3: $(a, b, c) \mapsto a \oplus b \oplus c$ | 5.67 | 4.75 | 3.75 | N/A |
| AND: $(a, b) \mapsto a \cdot b$ | 1.33 | 1.5 | 1.25 | 1.33 |
| NAND: $(a, b) \mapsto \overline{a \cdot b}$ | 1 | 1 | 1 | 1 |
| NAND3: $(a, b, c) \mapsto \overline{a \cdot b \cdot c}$ | 1.33 | 1.25 | 1.25 | 1.33 |
| OR: $(a, b) \mapsto a \mid b$ | 1.33 | 1.5 | 1.25 | 1.33 |
| NOR: $(a, b) \mapsto \overline{a \mid b}$ | 1 | 1 | 1 | 1 |
| NOT: $a \mapsto \bar{a}$ | 0.66 | 0.75 | 0.75 | 0.66 |
| OAI21: $(a, b, c) \mapsto \overline{(a \mid b) \cdot c}$ | 1.67 | 1.5 |  | 1.33 |
| AOI21: $(a, b, c) \mapsto \overline{(a \cdot b) \mid c}$ | 1.67 | 1.5 |  | 1.33 |
| OAI32: $(a, b, c, d, e) \mapsto \overline{(a\|b\| c) \cdot(d \mid e)}$ | 2.33 | N/A | 2 | N/A |

## 3 Tower Field Implementation of the $\mathbb{F}_{2^{8}}$ Inverter

In this subsection, we give an introduction to the tower field implementation of the $\mathbb{F}_{2^{8}}$ inverter. Please note that the derivation of these results can all be found in Canright's paper [11,10].

Consider the field extension $\mathbb{F}_{2^{4}} \subseteq \mathbb{F}_{2^{8}}$ with an irreducible polynomial $r(y)=$ $y^{2}+\tau y+\nu \in \mathbb{F}_{2^{4}}[y]$. Let $Y \in \mathbb{F}_{2^{8}}$ be a root of $r(y)$. Then $Y^{16}$ and $Y$ form a normal basis, and every element $G \in \mathbb{F}_{2^{8}}$ can be represented as $G=\gamma_{1} Y^{16}+\gamma_{0} Y$, where $\gamma_{1}, \gamma_{0} \in \mathbb{F}_{2^{4}}$. Let $G^{-1}=\delta_{1} Y^{16}+\delta_{0} Y$ with $\delta_{1}, \delta_{0} \in \mathbb{F}_{2^{4}}$ be the inverse of $G$. By Solving the equation

$$
G \cdot G^{-1}=\left(\gamma_{1} Y^{16}+\gamma_{0} Y\right)\left(\delta_{1} Y^{16}+\delta_{0} Y\right)=1
$$

for $\delta_{1}$ and $\delta_{0}$, we obtain

$$
\begin{aligned}
\delta_{1} & =\left[\gamma_{1} \gamma_{0} \tau^{2}+\left(\gamma_{1}+\gamma_{0}\right)^{2} \nu\right]^{-1} \gamma_{0} \\
\delta_{0} & =\left[\gamma_{1} \gamma_{0} \tau^{2}+\left(\gamma_{1}+\gamma_{0}\right)^{2} \nu\right]^{-1} \gamma_{1} .
\end{aligned}
$$

Therefore, given $r(y)$ and the basis $\left[Y^{16}, Y\right]$, we can compute $G^{-1}=\left(\delta_{1}, \delta_{0}\right)$ from $G=\left(\gamma_{1}, \gamma_{0}\right)$ using operations over $\mathbb{F}_{2^{4}}$, which is illustrated in Fig. 2, where $\phi=\gamma_{1} \gamma_{0} \tau^{2}+\left(\gamma_{1}+\gamma_{0}\right)^{2} \nu$ and $\lambda=\phi^{-1}$.


Fig. 2: The $\mathbb{F}_{2^{8}}$ inverter

Multiplication and Inverse over $\mathbb{F}_{2^{4}}$ Extend $\mathbb{F}_{2^{2}}$ to $\mathbb{F}_{2^{4}}$ with an irreducible polynomial $s(z)=z^{2}+T z+N \in \mathbb{F}_{2^{2}}[z]$, and let $Z \in \mathbb{F}_{2^{4}}$ be a root of $s(z)$. Then every element in $\mathbb{F}_{2^{4}}$ can be represented as an $\mathbb{F}_{2^{2}}$-linear combination of the normal basis $\left[Z^{4}, Z\right]$. Let $\gamma=\Gamma_{1} Z^{4}+\Gamma_{0} Z$, and $\lambda=\Lambda_{1} Z^{4}+\Lambda_{0} Z$, where $\Gamma_{i}, \Lambda_{j} \in \mathbb{F}_{2^{2}}$. Then the multiplication of $\gamma$ and $\lambda$ can be calculated as

$$
\begin{align*}
\gamma \lambda= & \left(\Gamma_{1} Z^{4}+\Gamma_{0} Z\right)\left(\Lambda_{1} Z^{4}+\Lambda_{0} Z\right) \\
= & {\left[\Gamma_{1} \Lambda_{1} T+\left(\Gamma_{1}+\Gamma_{0}\right)\left(\Lambda_{1}+\Lambda_{0}\right) N T^{2}\right] Z^{4}+}  \tag{1}\\
& {\left[\Gamma_{0} \Lambda_{0} T+\left(\Gamma_{1}+\Gamma_{0}\right)\left(\Lambda_{1}+\Lambda_{0}\right) N T^{2}\right] Z, }
\end{align*}
$$

which is illustrated in Figure 3.


Fig. 3: The $\mathbb{F}_{2^{4}}$ multiplier

Let $\phi=\Phi_{1} Z^{4}+\Phi_{0} Z$ with $\Phi_{i} \in \mathbb{F}_{2^{2}}$. It can be shown that the inverse $\phi^{-1}$ of $\phi$ is

$$
\begin{equation*}
\left[\Phi_{1} \Phi_{0} T^{2}+\left(\Phi_{1}+\Phi_{0}\right)^{2} N\right]^{-1} \Phi_{0} Z^{4}+\left[\Phi_{1} \Phi_{0} T^{2}+\left(\Phi_{1}+\Phi_{0}\right)^{2} N\right]^{-1} \Phi_{1} Z \tag{2}
\end{equation*}
$$

whose circuit is depicted in Figure 4.


Fig. 4: The $\mathbb{F}_{2^{4}}$ inverter

Multiplication and Inverse over $\mathbb{F}_{2^{2}}$. Consider the field extension $\mathbb{F}_{2} \subseteq \mathbb{F}_{2^{2}}$ with irreducible polynomial $t(w)=w^{2}+w+1 \in \mathbb{F}_{2}[w]$ (the only irreducible polynomial in $\mathbb{F}_{2}[w]$ ). Let $W$ be a root of $t(w)$ over $\mathbb{F}_{2^{2}}$. Then every element $\Gamma \in \mathbb{F}_{2^{2}}$ can be represented as an $\mathbb{F}_{2}$-linear combination $\Gamma=u_{1} W^{2}+u_{0} W$ of the normal basis $\left[W^{2}, W\right]$, with $u_{i} \in \mathbb{F}_{2}$. Let $\Delta=v_{1} W^{2}+v_{0} W$ with $v_{j} \in \mathbb{F}_{2}$ be another element in $\mathbb{F}_{2^{2}}$. The multiplication is given by

$$
\begin{align*}
\Gamma \Delta= & \left(u_{1} W^{2}+u_{0} W\right)\left(v_{1} W^{2}+v_{0} W\right) \\
= & {\left[u_{1} \cdot v_{1} \oplus\left(u_{1} \oplus u_{0}\right) \cdot\left(v_{1} \oplus v_{0}\right)\right] W^{2}+}  \tag{3}\\
& {\left[u_{0} \cdot v_{0} \oplus\left(u_{1} \oplus u_{0}\right) \cdot\left(v_{1} \oplus v_{0}\right)\right] W, }
\end{align*}
$$

whose implementation is shown in Figure 5. In addition, if $\Gamma \Delta=1$, it can be shown that $v_{1}=u_{0}$ and $v_{0}=u_{1}$. That is, the $\mathbb{F}_{2^{2}}$ inverter can be implemented by swapping the two 1 -bit input signals, which is free.


Fig. 5: The $\mathbb{F}_{2^{2}}$ multiplier

Remark. Finally, we would like to mention another two formulas which are useful later:

$$
\begin{align*}
\Gamma \Delta \cdot W & =\left(u_{1} \cdot v_{1} \oplus u_{0} \cdot v_{0}\right) W^{2}+\left[u_{1} \cdot v_{1} \oplus\left(u_{1} \oplus u_{0}\right) \cdot\left(v_{1} \oplus v_{0}\right)\right] W \\
\Gamma \Delta \cdot W^{2} & =\left[u_{0} \cdot v_{0} \oplus\left(u_{1} \oplus u_{0}\right) \cdot\left(v_{1} \oplus v_{0}\right)\right] W^{2}+\left(u_{1} \cdot v_{1} \oplus u_{0} \cdot v_{0}\right) W \tag{4}
\end{align*}
$$

According to Equation 4, the implementation cost of a multiplication followed with a $W$ (or $W^{2}$ ) scaler is the same as that of the multiplication $\Gamma \Delta$, which requires 4 XOR gates and 3 AND gates.

## 4 Applications to the S-boxes of AES, Camellia, and SM4

The S-boxes of AES, Camellia, and SM4 are all affine equivalent to the $\mathbb{F}_{2^{8}}$ inverter, which can be unified into the following form

$$
S(b)=M_{2} \cdot I_{\mathcal{P} \mathcal{B}}^{q}\left(M_{1} \cdot b \oplus C_{1}\right) \oplus C_{2}, b \in \mathbb{F}_{2^{8}}
$$

where $M_{1}, M_{2}$ are $8 \times 8$ matrices over $\mathbb{F}_{2}, C_{1}, C_{2}$ are constant column vectors in $\mathbb{F}_{2}^{8}$, and $I_{\mathcal{P B}}^{q}: \mathbb{F}_{2}^{8} \rightarrow \mathbb{F}_{2}^{8}$ is a function that maps the bit-vector representation of an element in $\mathbb{F}_{2^{8}}$ to the representation of its inverse in $\mathbb{F}_{2^{8}}$ under a polynomial basis of $\mathbb{F}_{2^{8}}$ over $\mathbb{F}_{2}$ induced by an irreducible polynomial $q(x) \in \mathbb{F}_{2}[x]$. We refer readers to Table 4 for the concrete values of these parameters for AES, Camellia and SM4.

Table 4: The parameters of the S-boxes of AES, Camellia, and SM4, where a hexadecimal number represents an irreducible polynomial in $\mathbb{F}_{2}[x]$ (e.g., $x^{8}+x^{4}+$ $x^{3}+x+1$ is represented by $\left.0 \times 11 \mathrm{~B}\right)$.

| Cipher | $M_{1}$ | $C_{1}$ | $M_{2}$ | $C_{2}$ | $q(x)$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| AES | $\left(\begin{array}{llllllll}1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1\end{array}\right)$ | $\left(\begin{array}{l}0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0\end{array}\right)$ | $\left(\begin{array}{llllllll}1 & 1 & 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 1 & 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 & 1 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 \\ 1 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 & 0 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 & 0 & 0 & 1 & 1 \\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 1\end{array}\right)$ | $\left(\begin{array}{l}0 \\ 1 \\ 1 \\ 0 \\ 0 \\ 0 \\ 1 \\ 1\end{array}\right)$ | 0x11B |
| Camellia* | $\left(\begin{array}{llllllll}0 & 0 & 1 & 0 & 1 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ 0 & 1 & 1 & 0 & 0 & 0 & 0 & 1 \\ 1 & 1 & 1 & 0 & 1 & 1 & 0 & 0 \\ 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0\end{array}\right)$ | $\left(\begin{array}{l}1 \\ 1 \\ 1 \\ 0 \\ 1 \\ 1 \\ 0 \\ 1\end{array}\right)$ | $\left(\begin{array}{llllllll}0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 \\ 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 1 & 1 & 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 0 & 1 & 1 & 0 & 1 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\end{array}\right)$ | $\left(\begin{array}{l}0 \\ 1 \\ 1 \\ 0 \\ 1 \\ 1 \\ 1 \\ 0\end{array}\right)$ | 0x169 |
| SM4 | $\left(\begin{array}{llllllll}1 & 1 & 0 & 1 & 0 & 0 & 1 & 1 \\ 1 & 1 & 1 & 0 & 1 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 & 0 & 1 & 0 & 0 \\ 0 & 1 & 1 & 1 & 1 & 0 & 1 & 0 \\ 0 & 0 & 1 & 1 & 1 & 1 & 0 & 1 \\ 1 & 0 & 0 & 1 & 1 & 1 & 1 & 0 \\ 0 & 1 & 0 & 0 & 1 & 1 & 1 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 & 1 & 1\end{array}\right)$ | $\left(\begin{array}{l}1 \\ 1 \\ 0 \\ 1 \\ 0 \\ 0 \\ 1 \\ 1\end{array}\right)$ | $\left(\begin{array}{llllllll}1 & 1 & 0 & 1 & 0 & 0 & 1 & 1 \\ 1 & 1 & 1 & 0 & 1 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 & 0 & 1 & 0 & 0 \\ 0 & 1 & 1 & 1 & 1 & 0 & 1 & 0 \\ 0 & 0 & 1 & 1 & 1 & 1 & 0 & 1 \\ 1 & 0 & 0 & 1 & 1 & 1 & 1 & 0 \\ 0 & 1 & 0 & 0 & 1 & 1 & 1 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 & 1 & 1\end{array}\right)$ | $\left(\begin{array}{l} 1 \\ 1 \\ 0 \\ 1 \\ 0 \\ 0 \\ 1 \\ 1 \end{array}\right)$ | 0x1F5 |

* The description of the Camellia S-box in the original specification [2] is different from ours. Readers could check the substitution table to confirm the equivalence.

However, it is difficult to implement the function $I_{\mathcal{P} \mathcal{B}}^{q}$ directly with small circuit footprint. Therefore, we first implement the function $I_{\mathcal{T B}}: \mathbb{F}_{2}^{8} \rightarrow \mathbb{F}_{2}^{8}$ which maps the representation of an element in $\mathbb{F}_{2^{8}}$ to the representation of its inverse element under the tower basis $\mathcal{T B}$. According to the discussion of Section 2, we have

$$
I_{\mathcal{P B}}^{q}(b)=M_{t} \cdot I_{\mathcal{T B}}\left(M_{t}^{-1} \cdot b\right) .
$$

Therefore, $S(b)$ can be implemented in practice as

$$
\begin{equation*}
S(b)=M_{2} M_{t} \cdot I_{\mathcal{T B}}\left(M_{t}^{-1} M_{1} \cdot b \oplus M_{t}^{-1} C_{1}\right) \oplus C_{2}, b \in \mathbb{F}_{2}^{8} \tag{5}
\end{equation*}
$$

Our goal is to identify a proper tower basis such that the overall circuit footprint of the implementation of $S(b)$ is minimized. Recalling the tower field architecture shown in Figure 1, the tower basis is completely determined by the three irreducible polynomials $r(y)=y^{2}+\tau y+\nu \in \mathbb{F}_{2^{4}}[y]$, $s(z)=z^{2}+T z+N \in \mathbb{F}_{2^{2}}[z]$, $t(w)=w^{2}+w+1 \in \mathbb{F}_{2}[w]$, and their roots $Y, Z$ and $W$. Therefore, the $2^{15}$ possible choices of $\tau, \nu, T, N, Y, Z$, and $W$ form the overall design space ${ }^{1}$, in which there are only 720 valid cases (we discard equivalent classes and non-irreducible polynomials).

Concretely, there are six possibilities for $(T, N)$ making $s(z)$ irreducible, and they are $\left\{(1, W),\left(1, W^{2}\right),(W, 1),(W, W),\left(W^{2}, 1\right),\left(W^{2}, W^{2}\right)\right\}$. For each possible choice of $(T, N), 120$ cases of $(\tau, \nu)$ can be identified such that $r(y)$ is irreducible. We can choose either one of the two roots for $W, Z$ and $Y$ because the other roots are exactly $W^{2}, Z^{4}$ and $Y^{16}$. So altogether there are $6 \times 120 \times 1 \times 1 \times 1=720$ valid cases. We exhaust all these cases for AES, Camellia and SM4 and list the optimal parameter choices in terms of compactness in Table 5.

Table 5: Optimal choices of parameters for AES, Camellia and SM4 in terms of compactness. The parameters are given with their polynomial basis representations (e.g., $X^{7}+X^{5}+X^{4}+X^{3}+X^{2}$ is represented by $0 \times B C$ ).

| Cipher | $W$ | $T$ | $N$ | $Z$ | $\tau$ | $\nu$ | $Y$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| AES | 0xBC | $0 \times B C$ | $0 \times 01$ | $0 \times B 0$ | $0 \times B D$ | $0 \times 5 C$ | $0 \times F 4$ |
| Camellia | $0 \times 7 \mathrm{E}$ | $0 \times 7 \mathrm{E}$ | $0 \times 01$ | $0 \times 15$ | $0 \times 01$ | $0 \times 06$ | $0 \times 02$ |
| SM4 | 0x5C | 0x5C | $0 \times 01$ | $0 \times 7 \mathrm{~A}$ | $0 \times 77$ | $0 \times 27$ | $0 \times 66$ |

According to Table 5, for the parameters of AES, we have the following relationship:

$$
\begin{equation*}
T=W, N=1, \tau=T \cdot Z^{4}+T \cdot Z, \nu=0 \cdot Z^{4}+T^{2} \cdot Z \tag{6}
\end{equation*}
$$

Similarly, for Camellia, we have $T=W, N=1, \tau=T^{2} \cdot Z^{4}+T^{2} \cdot Z, \nu=T \cdot Z^{4}+0 \cdot Z$, and for SM4, we have $T=W, N=1, \tau=T^{2} \cdot Z^{4}+1 \cdot Z, \nu=T \cdot Z^{4}+T^{2} \cdot Z$. In what follows, we focus on the optimization of the AES S-box with the optimal parameter we identified as an example. For Camellia, SM4 and other parameters, the same procedure is performed.
4.1 Optimized Implementation of the $\mathbb{F}_{2^{4}}$ Multiplier, $\tau^{2}$ Multiplier, and the Square- $\nu$-scaler as a Whole

Before giving the optimized implementation, we unfold the circuits of the $\mathbb{F}_{2^{4}}$ multiplier, $\tau^{2}$ multiplier, and square- $\nu$-scaler one by one without any optimization. Based on these unfolded circuits, we reduce their areas by applying several logic minimization techniques with necessary tweaks.

[^1]$\mathbb{F}_{2^{4}}$ Multiplier. By plugging Figure 5 into Figure 3, we obtain the gate-level circuit of the $\mathbb{F}_{2^{4}}$ multiplier shown in Figure 6, which can also be derived by substituting Equation 3 and 4 into Equation 1.


Fig. 6: $\mathbb{F}_{2^{4}}$ multiplier
$\tau^{2}$ multiplier. According to Table 5 and Equation 6, we have $\tau=T Z^{4}+T Z=$ $W Z^{4}+W Z$ and $\tau^{2}=Z^{4}+Z$. Let $\alpha=\left(a_{3} W^{2}+a_{2} W\right) Z^{4}+\left(a_{1} W^{2}+a_{0} W\right) Z$ be an arbitrary element in $\mathbb{F}_{2^{4}}$. We have

$$
\begin{equation*}
\alpha \tau^{2}=\left[\left(a_{3} \oplus a_{2}\right) W^{2}+a_{3} W\right] Z^{4}+\left[\left(a_{1} \oplus a_{0}\right) W^{2}+a_{1} W\right] Z, \tag{7}
\end{equation*}
$$

leading to the gate-level circuit of the $\tau^{2}$ multiplier shown in Figure 7.


Fig. 7: $\tau^{2}$ multiplier
$\mathbb{F}_{2^{4}}$ square- $\nu$-scaler. According to Table 5 and Equation $6, \nu=W^{2} Z$. Let $\alpha=$ $\left(a_{3} W^{2}+a_{2} W\right) Z^{4}+\left(a_{1} W^{2}+a_{0} W\right) Z$ be an arbitrary element in $\mathbb{F}_{2^{4}}$. Then $\alpha^{2} \nu$ can be computed as $\alpha^{2} \nu=\left[\left(a_{3}+a_{1}\right) W^{2}+\left(a_{3}+a_{2}+a_{1}+a_{0}\right) W\right] Z^{4}+\left[\left(a_{1}+a_{0}\right) W^{2}+\right.$ $\left.a_{0} W\right] Z$, whose gate-level circuit is shown in Figure 8.


Fig. 8: $\mathbb{F}_{2^{4}}$ square- $\nu$-scaler


Fig. 9: A part of the $\mathbb{F}_{2^{8}}$ inverter (unoptimized)

Now, we can assemble the $\mathbb{F}_{2^{4}}$ multiplier, $\tau^{2}$ multiplier, and square- $\nu$-scaler to obtain a part of the $\mathbb{F}_{2^{8}}$ inverter according to Figure 2, which gives the circuit shown in Figure 9. According to Equation 4, this circuit can be partially optimized with some tweaks on the eight XOR gates appearing at the lower part of Figure 9, leading to the circuit shown in Figure 10. Subsequently, by applying the formula

$$
\begin{equation*}
a \cdot b \oplus a \oplus b=a \mid b \tag{8}
\end{equation*}
$$

we can transform the circuit shown in Figure 10 into the circuit presented in Figure 11. According to Equation 8, at best, we can replace two XOR gates and


Fig. 10: A part of the $\mathbb{F}_{2^{8}}$ inverter (partially optimized)
one AND gate by a single OR gate. This happens for the gates marked by number 3 . However, when some intermediate value of the computation $a \cdot b \oplus a \oplus b$ is required, we still need to keep some intermediate gates. For example, we can only replace two XOR gates with one OR gate and keep the AND gate intact. Similarly, for the gates marked with number 1 and number 2 , we can only replace one XOR gates with one OR gate. Finally, by applying the formulas $a \cdot b \oplus c \cdot d=\overline{a \cdot b} \oplus \overline{c \cdot d}$ and $a \cdot b \oplus c \mid d=\overline{a \cdot b} \oplus \overline{c \mid d}$, the AND gates and OR gates can be substituted by NAND gates and NOR gates respectively.

### 4.2 Optimized Implementation of the $\mathbb{F}_{2^{4}}$ Inverter

Based on the selected parameters for $\operatorname{AES}(T=W$ and $N=1)$ given in Table 5 and Equation 6, Equation 2 can be simplified as

$$
\begin{equation*}
\phi^{-1}=\left[\Phi_{1} \Phi_{0} W^{2}+\left(\Phi_{1}+\Phi_{0}\right)^{2}\right]^{-1} \Phi_{0} Z^{4}+\left[\Phi_{1} \Phi_{0} W^{2}+\left(\Phi_{1}+\Phi_{0}\right)^{2}\right]^{-1} \Phi_{1} Z \tag{9}
\end{equation*}
$$

Deviating from previous implementations $[10,11,9]$, we regard the $\mathbb{F}_{2^{4}}$ inverter as a $4 \times 4 \mathrm{~S}$-box whose permutation table is determined by Equation 9:
$[0 \mathrm{x} 0,0 \mathrm{x} 8,0 \mathrm{x} 4,0 \mathrm{xC}, 0 \mathrm{x} 2,0 \mathrm{xF}, 0 \mathrm{x} 7,0 \mathrm{x} 6,0 \mathrm{x} 1,0 \mathrm{xD}, 0 \mathrm{xA}, 0 \mathrm{xE}, 0 \mathrm{x} 3,0 \mathrm{x} 9,0 \mathrm{xB}, 0 \mathrm{x} 5]$.


Fig. 11: A part of the $\mathbb{F}_{2^{8}}$ inverter (optimized)

To obtain optimized implementations of this S-box, we consider two recently proposed techniques. First, we employ Stoffelen's SAT-based technique [30] to produce a circuit of the $4 \times 4$ S-box: $\left(x_{3}, x_{2}, x_{1}, x_{0}\right) \mapsto\left(y_{3}, y_{2}, y_{1}, y_{0}\right)$ with minimized gate complexity, and the result is shown below:

$$
\left.\begin{array}{rlrl}
t_{1} & =\overline{x_{3} \cdot x_{0}} & t_{2} & =t_{1} \mid x_{2} \\
t_{5} & =\overline{x_{2} \mid t_{4}} & t_{3} & =\overline{x_{2} \cdot x_{0}} \\
t_{4} & =x_{1} \oplus t_{3} & t_{6} & =\overline{x_{1} \cdot t_{4}} \\
t_{7} & =x_{3} \mid t_{4} & =t_{7} \cdot t_{2} & t_{9}
\end{array}=t_{5} \oplus t_{7}\right] ~\left(t_{11}=\overline{t_{6} \cdot t_{8}} \quad\left(y_{2}\right) \quad t_{12}=\overline{t_{8} \cdot x_{1}}\right.
$$

which contains 4 XOR/XNOR gates, 1 AND gate, 7 NAND gates, 2 OR gates and 1 NOR gate. ${ }^{2}$. This circuit (referred as SAT) can be further optimized manually. Since $a \cdot b=\overline{\bar{a} \mid \bar{b}}$, we can change the AND gate in $t_{8}$ to NOR gate, and negate its input signals without changing the overall functionality of the circuit. This new circuit (referred as SAT $^{*}$ ) contains 4 XOR/XNOR gates, 7 NAND gates and 4 NOR gates (modified signals

[^2]are colored in red):
\[

$$
\begin{aligned}
& t_{1}=\overline{x_{3} \cdot x_{0}} \\
& t_{4}=x_{1} \oplus t_{3} \\
& t_{2}=\overline{t_{1} \mid x_{2}} \\
& t_{3}=\overline{x_{2} \cdot x_{0}} \\
& t_{7}=\overline{x_{3} \mid t_{4}} \\
& t_{10}=t_{9} \odot x_{3} \quad\left(y_{0}\right) \\
& t_{8}=\overline{t_{7} \mid t_{2}} \\
& t_{13}=x_{0} \odot t_{12} \quad\left(y_{3}\right) \quad t_{14}=\overline{t_{1} \cdot x_{2}} \quad t_{15}=\overline{t_{9} \cdot t_{14}} \quad\left(y_{1}\right) .
\end{aligned}
$$
\]

We also apply the LIGHTER [18] tool to the $4 \times 4$ S-box (the $\mathbb{F}_{2^{4}}$ inverter) for four different technology libraries, which leads to the same circuit (referred as LIGHTER) containing 7 XOR/XNOR gates, 4 NAND gates, 1 NAND3 gate, 1 NOR gate and 1 NOT gate shown in the following:

$$
\begin{array}{rlrl}
t_{1} & =x_{2} \oplus x_{3} & t_{2} & =\overline{t_{1} \cdot x_{0} \cdot x_{3}} \\
t_{4} & =\overline{t_{3} \cdot t_{1}} & t_{5} & =x_{0} \oplus t_{4} \\
t_{7} & =t_{1} \oplus t_{6} & t_{8} & =\overline{t_{5} \mid t_{7}} \\
t_{6} & =\overline{x_{3} \cdot t_{5}} \\
t_{10} & =\overline{t_{9} \cdot t_{3}} & t_{11} & =t_{5} \oplus t_{10}
\end{array} \quad\left(y_{3}\right) \quad t_{9}=x_{3} \oplus t_{8} \quad\left(y_{0}\right)
$$

A comparison of the above three circuits together with their synthesizing results is given in Table 6 and Table 7, from which we can see that SAT* is always the best, whose circuit is depicted in Figure 12.


Fig. 12: The optimized circuit for the $\mathbb{F}_{2^{4}}$ inverter (SAT*)
4.3 Optimized Implementation of the Two $\mathbb{F}_{2^{4}}$ Multipliers with 4-bit Common Input

Observing Figure 2, the $\mathbb{F}_{2^{8}}$ inverter contains three $\mathbb{F}_{2^{4}}$ multipliers, from which we can identify three pairs of $\mathbb{F}_{2^{4}}$ multipliers such that each pair shares a 4-bit

Table 6: Gate counts for different implementations of $\mathbb{F}_{2^{4}}$ inverter, where the circuit named Canright is the implementation of Equation 9 using the method in [10, 11], and the circuit named Boyar uses the method in [9].

| Circuit | Gates used |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | XOR/XNOR | AND | NAND | NAND3 | OR | NOR | NOT |
| Canright | 9 |  | 8 |  |  | 2 |  |
| Boyar | 9 | 6 |  |  |  |  |  |
| Boyar* | 9 |  | 6 |  |  |  |  |
| SAT | 4 | 1 | 7 |  | 2 | 1 |  |
| SAT* | 4 |  | 7 |  |  | 4 |  |
| LIGHTER | 7 |  | 4 | 1 |  | 1 | 1 |

Table 7: Synthesized results for different implementations of the $\mathbb{F}_{2^{4}}$ inverter

| Circuit | Synthesis results |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | SMIC 130nm | SMIC 65nm | STM 65nm | Nangate 45nm |
| Canright | 30.97 | 30.25 | 28.00 | 28.00 |
| Boyar | 28.95 | 29.25 | 25.50 | 25.98 |
| Boyar* | 26.97 | 26.25 | 24.00 | 24.00 |
| SAT | 21.31 | 21.50 | 20.25 | 19.99 |
| SAT* | $\mathbf{2 0 . 3 2}$ | $\mathbf{2 0 . 0 0}$ | $\mathbf{1 9 . 0 0}$ | $\mathbf{1 9 . 0 0}$ |
| LIGHTER | 23.30 | 22.75 | 21.00 | 20.99 |

input signal: the leftmost $\mathbb{F}_{2^{4}}$ multiplier and the rightmost upper $\mathbb{F}_{2^{4}}$ multiplier, the leftmost $\mathbb{F}_{2^{4}}$ multiplier and the rightmost lower $\mathbb{F}_{2^{4}}$ multiplier, and the two rightmost $\mathbb{F}_{2^{4}}$ multipliers. It is shown in $[11,10]$ that whenever two $\mathbb{F}_{2^{4}}$ multipliers share a common 4 -bit input signal, some XOR gates can be saved via signal reuse.

As an example, we unfold the two rightmost $\mathbb{F}_{2^{4}}$ multipliers in Figure 2 according to Figure 6, and the schematic is shown in Figure 13. By observing Figure 13 carefully, we can spot some outputs of XOR gates which are computed twice in the circuit [11,10] (labeled with same numbers in the figure). Therefore, for each pair of $\mathbb{F}_{2^{4}}$ multipliers sharing a 4-bit input signal, we can remove 5 XOR gates by signal reuse. Therefore, 3 pairs of $\mathbb{F}_{2^{4}}$ multipliers with shared input signals in total save $5 \times 3=15$ XOR gates.

### 4.4 Optimized Implementation of the Input and Output Affine Parts

According to Equation 5, before going into the $\mathbb{F}_{2^{8}}$ inverter $I_{\mathcal{T B}}(\cdot)$, the 8-bit input signal of the AES S-box first goes through an affine transformation

$$
b \mapsto g=M_{t}^{-1} M_{1} \cdot b \oplus M_{t}^{-1} C_{1},
$$

which then spawns 18 1-bit signals (see Figure 11) subsequently fed into some non-linear gates (NAND, NOR). The transformation from the 81 -bit input signals to


Fig. 13: The two rightmost $\mathbb{F}_{2^{4}}$ multipliers with a shared 4-bit input
the 18 1-bit signals is affine, and can be represented as an $18 \times 8$ matrix

$$
U=\left(\begin{array}{llllllllllllllllll}
0 & 1 & 0 & 0 & 0 & 1 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 1 \\
0 & 0 & 1 & 1 & 1 & 1 & 1 & 1 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 1 & 0 & 1 & 0 \\
0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 0 & 0 \\
0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 1 & 0 \\
0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 & 1 & 1 & 0 \\
0 & 1 & 1 & 0 & 0 & 1 & 0 & 1 & 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 & 1 & 1 \\
1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0
\end{array}\right)^{\top},
$$

with the constant

$$
M_{t}^{-1} C_{1}=\left(\begin{array}{lllllllllllll}
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0
\end{array} 0^{\top}\right. \text {. }
$$

By applying the SAT-based method for solving SLP (Shortest Linear StraightLine Program) [14], we obtain the optimal implementation of $U$, which costs 19

XOR gates ${ }^{3}$ :

$$
\begin{aligned}
& y_{17}=x_{0} \quad\left(y_{17}\right) \quad t_{1}=x_{7} \oplus x_{4} \quad\left(y_{6}\right) \quad t_{2}=x_{3} \oplus x_{1} \\
& t_{3}=x_{2} \oplus t_{2} \quad t_{4}=x_{6} \oplus t_{3} \quad\left(y_{7}\right) \quad t_{5}=x_{0} \oplus t_{4} \quad\left(y_{15}\right) \\
& t_{6}=x_{5} \oplus t_{3} \quad\left(y_{1}\right) \quad t_{7}=t_{5} \oplus t_{6} \quad\left(y_{14}\right) \quad t_{8}=x_{4} \oplus t_{7} \quad\left(y_{13}\right) \\
& t_{9}=t_{1} \oplus t_{2} \quad\left(y_{9}\right) \quad t_{10}=t_{1} \oplus t_{8} \quad\left(y_{11}\right) \quad t_{11}=x_{0} \oplus t_{9} \quad\left(y_{16}\right) \\
& t_{12}=x_{4} \oplus x_{2} \quad\left(y_{2}\right) \quad t_{13}=x_{1} \oplus t_{7} \quad\left(y_{10}\right) \quad t_{14}=x_{7} \oplus x_{2} \quad\left(y_{4}\right) \\
& t_{15}=t_{13} \oplus t_{14} \quad\left(y_{12}\right) \quad t_{16}=x_{7} \oplus x_{1} \quad\left(y_{0}\right) \quad t_{17}=t_{12} \oplus t_{16} \quad\left(y_{8}\right) \\
& t_{18}=t_{6} \oplus t_{9} \quad\left(y_{3}\right) \quad t_{19}=t_{7} \oplus t_{11} \quad\left(y_{5}\right)
\end{aligned}
$$

where $x_{i}$ 's are the input signals, $t_{i}$ 's are intermediate signals, and $y_{i}$ 's are output signals.

Similarly, according to Equation 5, at the output end of the AES S-box, the 8 -bit output of the two rightmost $\mathbb{F}_{2^{4}}$ multipliers (see Figure 2 and Figure 14) is transformed by the affine mapping $M_{2} M_{t}(\cdot) \oplus C_{2}$ to recover the polynomial basis representation. Observing Figure 14, the 8 input bits of the affine mapping (also the output bits of the two $\mathbb{F}_{2^{4}}$ multipliers) are originated from the 18 output bits of the NAND gates. Moreover, only XOR gates are involved to generate the 8 input bits of the affine mapping from these 18 bits. Therefore, the mapping from the 18 output bits of the NAND gates to the 8 output bits of the S-box is affine, which can be described by the following matrix

$$
B=\left(\begin{array}{lllllllllllllllllll}
0 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 1 & 1 \\
0 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \\
0 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 \\
0 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 1 \\
1 & 0 & 1 & 0 & 1 & 1 & 1 & 1 & 0 & 1 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 1 \\
0 & 0 & 0 & 0 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 1 & 1 & 0 & 1 \\
0 & 0 & 0 & 0 & 1 & 1 & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & 0
\end{array}\right) .
$$

Observing Figure 14, there are two layers of XOR gates following the 18 NAND gates. According these two layers of XOR gates, we decompose the whole output affine transformation (from the 18 output bits of the 18 NAND gates to the 8 output bits of the AES S-box) into two parts. The first part maps the 18 output bits of the 18 NAND gates to the 12 output bits of the first layer of 12 XOR gates, which can be implemented with in total 12 XOR gates as shown in Figure 14. The second part maps the 12 output bits of the 12 XOR gates to the 8 output bits of the S -box, and its matrix representation $B^{\prime}$ is given in the following:

$$
B^{\prime}=\left(\begin{array}{llllllllllll}
0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 1  \tag{10}\\
0 & 1 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 0 & 0 \\
0 & 1 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 \\
0 & 1 & 0 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\
1 & 1 & 0 & 1 & 1 & 0 & 1 & 0 & 1 & 1 & 0 & 1 \\
0 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 1 & 1 & 1 & 1 \\
0 & 0 & 0 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0
\end{array}\right)
$$

[^3]

Fig. 14: The circuit for bottom part

Again, by applying the SAT based method for SLP [14] and taking $C_{2}$ into account, the affine transformation involving $B^{\prime}$ and constant addition at the output end can be realized as follows, requiring $17 \mathrm{XOR} / \mathrm{XNOR}$ gates ${ }^{4}$ :

$$
\begin{aligned}
& t_{1}=x_{2} \oplus x_{0} \\
& t_{2}=x_{10} \oplus t_{1} \\
& t_{3}=x_{8} \oplus t_{2} \quad\left(y_{7}\right) \\
& t_{4}=x_{5} \oplus x_{1} \\
& t_{5}=x_{5} \oplus x_{3} \\
& t_{8}=x_{2} \oplus t_{3} \\
& t_{6}=x_{7} \oplus t_{5} \\
& t_{7}=x_{8} \oplus x_{6} \\
& t_{11}=x_{9} \odot t_{6} \quad\left(y_{5}\right) \\
& t_{9}=x_{4} \oplus t_{8} \quad\left(y_{4}\right) \\
& t_{10}=t_{1} \oplus t_{5} \\
& t_{14}=x_{11} \oplus t_{13} \quad\left(y_{2}\right) \\
& t_{12}=t_{4} \odot t_{7} \quad\left(y_{0}\right) \\
& t_{13}=t_{3} \oplus t_{6} \\
& \left(y_{1}\right) \\
& t_{17}=t_{4} \oplus t_{9} \quad\left(y_{3}\right)
\end{aligned}
$$

[^4]where $x_{i}$ 's are the input signals, $t_{i}$ 's are intermediate signals, and $y_{i}$ 's are output signals.

### 4.5 Overall Implementation Results and Comparison

We synthesis the optimized implementations of the S-boxes (AES, Camellia, SM4) using Synopsys Design Compiler 2014 (DC 2014) with four technology libraries, and the synthesized results ${ }^{5}$ together with their technology-independent gate counts are listed in Table 2.

To make the full use of the libraries to save the circuit area, Reyhani-Masoleh et. al.'s implementations [26] exploit certain compound gates in specific libraries (e.g., XOR3, NAND3, OAI21, AOI21, OAI32), which are not universally available in all technology libraries. For example, the optimal implementation offered by [26] employs XOR3 and OAI32 gates, reaching 182.25 GE under the STM 65 nm CMOS technology.

According to the results shown in Table 2, our implementation requires only 179 GE , which beats the record set by [26] even without using any compound gates. Moreover, when the area of one XOR3 gate is smaller than the area of two XOR gates in underlying technology library, the compound gates XOR3 can be applied in our design to take the place of some standard XOR gates. With this improvements, the area of our implementation of the AES S-box can be further reduced to 176.75 GE . For the S-boxes of Camellia and SM4, it can be seen from Table 2 that the improvements are even more obvious.

## 5 Conclusion

By applying state-of-the-art combinatorial logic minimization techniques to an exhaustive list of tower field representations of the AES, Camellia, and SM4 Sboxes with normal bases, we identify so far the most compact implementations of these S-boxes. The results obtained in this work can be used in compact and threshold implementations of AES, Camellia, and SM4. As a potential further work, it is interesting to see how to apply similar techniques to obtain compact implementations of combined S-box/inverse S-box designs for AES, Camellia, and SM4. Moreover, this work only focus on minimizing the circuit area, it is of equal importance to investigate how to reduce the depth of the circuit as in [26,20].

Acknowledgment. The work is supported by the National Key R\&D Program of China (Grant No. 2018YFB0804402), the Chinese Major Program of National Cryptography Development Foundation (Grant No. MMJJ20180102), the National Natural Science Foundation of China (61732021, 61802400, 61772519, 61802399), and the Youth Innovation Promotion Association of Chinese Academy of Sciences.

[^5]
## References

1. Abbasi, I., Afzal, M.: A compact S-Box design for SMS4 block cipher. IACR Cryptology ePrint Archive 2011, 522 (2011)
2. Aoki, K., Ichikawa, T., Kanda, M., Matsui, M., Moriai, S., Nakajima, J., Tokita, T.: Camellia: A 128-bit block cipher suitable for multiple platforms - design and analysis. Lecture Notes in Computer Science pp. 39-56 (2000)
3. Bai, X., Xu, Y., Guo, L.: Securing SMS4 cipher against differential power analysis and its VLSI implementation. In: IEEE Singapore International Conference on Communication Systems, pp. 167-172 (2009)
4. Banik, S., Bogdanov, A., Minematsu, K.: Low-area hardware implementations of CLOC, SILC and AES-OTR. In: 2016 IEEE International Symposium on Hardware Oriented Security and Trust, HOST 2016, McLean, VA, USA, May 3-5, 2016, pp. 71-74 (2016)
5. Banik, S., Bogdanov, A., Regazzoni, F.: Atomic-AES: A compact implementation of the AES encryption/decryption core. In: Progress in Cryptology - INDOCRYPT 2016-17th International Conference on Cryptology in India, Kolkata, India, December 11-14, 2016, Proceedings, pp. 173-190 (2016)
6. Beierle, C., Kranz, T., Leander, G.: Lightweight multiplication in $\operatorname{GF}\left(2^{n}\right)$ with applications to MDS matrices. In: Advances in Cryptology - CRYPTO 2016-36th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 14-18, 2016, Proceedings, Part I, pp. 625-653 (2016)
7. Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: A more efficient AES threshold implementation. In: Progress in Cryptology - AFRICACRYPT 2014-7th International Conference on Cryptology in Africa, Marrakesh, Morocco, May 28-30, 2014. Proceedings, pp. 267-284 (2014)
8. Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: Trade-offs for threshold implementations illustrated on AES. IEEE Trans. on CAD of Integrated Circuits and Systems 34(7), 1188-1200 (2015)
9. Boyar, J., Matthews, P., Peralta, R.: Logic minimization techniques with applications to cryptology. J. Cryptology 26(2), 280-312 (2013)
10. Canright, D.: A very compact Rijndael S-box. Tech. rep., Naval Postgraduate School (2005). NPS-MA-05-001
11. Canright, D.: A very compact S-Box for AES. In: Cryptographic Hardware and Embedded Systems - CHES 2005, 7th International Workshop, Edinburgh, UK, August 29 September 1, 2005, Proceedings, pp. 441-455 (2005)
12. Cnudde, T.D., Reparaz, O., Bilgin, B., Nikova, S., Nikov, V., Rijmen, V.: Masking AES with $d+1$ shares in hardware. In: Cryptographic Hardware and Embedded Systems CHES 2016-18th International Conference, Santa Barbara, CA, USA, August 17-19, 2016, Proceedings, pp. 194-212 (2016)
13. Daemen, J., Rijmen, V.: The Design of Rijndael: AES - The Advanced Encryption Standard. Information Security and Cryptography. Springer (2002)
14. Fuhs, C., Schneider-Kamp, P.: Synthesizing shortest linear straight-line programs over GF(2) using SAT. In: Theory and Applications of Satisfiability Testing - SAT 2010, 13th International Conference, SAT 2010, Edinburgh, UK, July 11-14, 2010. Proceedings, pp. 71-84 (2010)
15. Guajardo, J., Paar, C.: Efficient algorithms for elliptic curve cryptosystems. In: Advances in Cryptology - CRYPTO '97, 17th Annual International Cryptology Conference, Santa Barbara, California, USA, August 17-21, 1997, Proceedings, pp. 342-356 (1997)
16. Itoh, T., Tsujii, S.: A fast algorithm for computing multiplicative inverses in $\operatorname{GF}\left(2^{m}\right)$ ) using normal bases. Inf. Comput. 78(3), 171-177 (1988)
17. Jean, J., Moradi, A., Peyrin, T., Sasdrich, P.: Bit-sliding: A generic technique for bit-serial implementations of SPN-based primitives - applications to AES, PRESENT and SKINNY. In: Cryptographic Hardware and Embedded Systems - CHES 2017-19th International Conference, Taipei, Taiwan, September 25-28, 2017, Proceedings, pp. 687-707 (2017)
18. Jean, J., Peyrin, T., Sim, S.M., Tourteaux, J.: Optimizing implementations of lightweight building blocks. IACR Trans. Symmetric Cryptol. 2017(4), 130-168 (2017)
19. Kranz, T., Leander, G., Stoffelen, K., Wiemer, F.: Shorter linear straight-line programs for MDS matrices. IACR Trans. Symmetric Cryptol. 2017(4), 188-211 (2017)
20. Li, S., Sun, S., Li, C., Wei, Z., Hu, L.: Constructing low-latency involutory MDS matrices with lightweight circuits. IACR Trans. Symmetric Cryptol. 2019(1), 84-117 (2019)
21. Martínez-Herrera, A.F., Mex-Perera, J.C., Nolazco-Flores, J.A.: Some representations of the S-Box of Camellia in $\operatorname{GF}\left(\left(\left(2^{2}\right)^{2}\right)^{2}\right)$. In: Cryptology and Network Security, 11 th International Conference, CANS 2012, Darmstadt, Germany, December 12-14, 2012. Proceedings, pp. 296-309 (2012)
22. Martínez-Herrera, A.F., Mex-Perera, J.C., Nolazco-Flores, J.A.: Merging the Camellia, SMS4 and AES S-Boxes in a single S-Box with composite bases. In: Information Security, 16th International Conference, ISC 2013, Dallas, Texas, USA, November 13-15, 2013, Proceedings, pp. 209-217 (2013)
23. Mentens, N., Batina, L., Preneel, B., Verbauwhede, I.: A systematic evaluation of compact hardware implementations for the Rijndael S-Box. In: Topics in Cryptology - CT-RSA 2005, The Cryptographers' Track at the RSA Conference 2005, San Francisco, CA, USA, February 14-18, 2005, Proceedings, pp. 323-333 (2005)
24. Moradi, A., Poschmann, A., Ling, S., Paar, C., Wang, H.: Pushing the limits: A very compact and a threshold implementation of AES. In: Advances in Cryptology - EUROCRYPT 2011 - 30th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Tallinn, Estonia, May 15-19, 2011. Proceedings, pp. 69-88 (2011)
25. Paar, C., Soria-Rodriguez, P.: Fast arithmetic architectures for public-key algorithms over Galois Fields GF (( $\left.\left.2^{n}\right)^{m}\right)$. In: Advances in Cryptology - EUROCRYPT '97, International Conference on the Theory and Application of Cryptographic Techniques, Konstanz, Germany, May 11-15, 1997, Proceeding, pp. 363-378 (1997)
26. Reyhani-Masoleh, A., Taha, M.M.I., Ashmawy, D.: Smashing the implementation records of AES s-box. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018(2), 298-336 (2018)
27. Rudra, A., Dubey, P.K., Jutla, C.S., Kumar, V., Rao, J.R., Rohatgi, P.: Efficient Rijndael encryption implementation with composite field arithmetic. In: Cryptographic Hardware and Embedded Systems - CHES 2001, Third International Workshop, Paris, France, May 14-16, 2001, Proceedings, Generators, pp. 171-184 (2001)
28. Satoh, A., Morioka, S.: Unified hardware architecture for 128-bit block ciphers AES and Camellia. In: Cryptographic Hardware and Embedded Systems - CHES 2003, 5th International Workshop, Cologne, Germany, September 8-10, 2003, Proceedings, pp. 304-318 (2003)
29. Satoh, A., Morioka, S., Takano, K., Munetoh, S.: A compact Rijndael hardware architecture with S-Box optimization. In: Advances in Cryptology - ASIACRYPT 2001, 7th International Conference on the Theory and Application of Cryptology and Information Security, Gold Coast, Australia, December 9-13, 2001, Proceedings, pp. 239-254 (2001)
30. Stoffelen, K.: Optimizing S-Box implementations for several criteria using SAT solvers. In: Fast Software Encryption - 23rd International Conference, FSE 2016, Bochum, Germany, March 20-23, 2016, Revised Selected Papers, pp. 140-160 (2016)
31. Team, C.M.: http://www.cs.yale.edu/homes/peralta/CircuitStuff/CMT.html
32. Wolkerstorfer, J., Oswald, E., Lamberger, M.: An ASIC implementation of the AES SBoxes. In: Topics in Cryptology - CT-RSA 2002, The Cryptographer's Track at the RSA Conference, 2002, San Jose, CA, USA, February 18-22, 2002, Proceedings, pp. 67-78 (2002)

## A Verilog Code for the AES S-box

```
/* The AES S-box */
module AES (b, Sb);
    input [7:0] b;
    output [7:0] Sb;
    wire [7:0] g;
    wire [9:0] m;
    wire [3:0] p, l;
    wire [17:0] e;
    Input M1(b, g, m);
    Top M2(g, m, p);
    Middle M3(p, l);
    Bottom M4(g, m, l, e);
```

```
    Output M5(e, Sb);
endmodule
/* Gates implemented as modules to prevent
    unintentional optimization of the DC */
module XOR (t, a, b);
            output t;
    input a, b;
    xor(t, a, b);
endmodule
module XNOR (t, a, b);
    output t;
    input a, b;
    xnor(t, a, b);
endmodule
module NAND (t, a, b);
        output t;
        input a, b;
        nand(t, a, b);
endmodule
module NOR (t, a, b);
    output t;
    input a, b;
    nor(t, a, b);
endmodule
/* input matrix */
module Input (b, g, m);
        input [7:0] b;
        output [7:0] g;
        output [9:0] m;
        wire t1, t2, t3, t4, t5, t6, t7, t8, t9, t10,
                t11, t12, t13, t14, t15, t16, t17, t18, t19;
        XOR m1(t1, b[7], b[4]);
        XOR m2(t2, b[3], b[1]);
        XOR m3(t3, b[2], t2);
        XOR m4(t4, b[6], t3);
        XOR m5(t5, b[0], t4);
        XOR m6(t6, b[5], t3);
        XOR m7(t7, t5, t6);
        XOR m8(t8, b[4], t7);
        XOR m9(t9, t1, t2);
        XOR m10(t10, t1, t8);
        XOR m11(t11, b[0], t9);
        XOR m12(t12, b[4], b[2]);
        XOR m13(t13, b[1], t7);
        XOR m14(t14, b[7], b[2]);
        XOR m15(t15, t13, t14);
        XOR m16(t16, b[7], b[1]);
        XOR m17(t17, t12, t16);
        XOR m18(t18, t6, t9);
        XOR m19(t19, t7, t11);
        assign g = {b[0], t11, t5, t7, t8, t15, t10, t13};
        assign m = {t9, t17, t4, t1, t19, t14, t18, t12, t6, t16};
endmodule
```

```
/* top part: GF(2^4) multiplier and GF(2^4) square-scaler */
module Top (g, m, p);
    input [7:0] g;
    input [9:0] m;
    output [3:0] p;
    wire t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13,
        t14, t15, t16, t17, t18, t19, t20, t21, t22, t23, t24;
    NAND m1(t1, g[6], g[2]);
    NAND m2(t2, m[9], m[8]);
    NAND m3(t3, g[7], g[3]);
    XOR m4(t13, t1, t2);
    XOR m5(t14, t3, t1);
    NAND m6(t4, m[7], m[6]);
    NOR m7(t5, m[7], m[6]);
    NAND m8(t6, m[3], m[2]);
    NOR m9(t7, m[3], m[2]);
    NAND m10(t8, m[5], m[4]);
    XOR m11(t15, t5, t6);
    XOR m12(t16, t8, t7);
    XOR m13(t17, t4, t6);
    XOR m14(t18, t8, t6);
    NAND m15(t9, g[4], g[0]);
    NOR m16(t10, m[1], m[0]);
    NAND m17(t11, g[5], g[1]);
    NOR m18(t12, g[4], g[0]);
    XOR m19(t19, t9, t10);
    XOR m20(t20, t11, t12);
    XOR m21(t21, t13, t15);
    XOR m22(t22, t14, t16);
    XOR m23(t23, t19, t17);
    XOR m24(t24, t20, t18);
    assign p[3] = t21;
    assign p[2] = t22;
    assign p[1] = t23;
    assign p[0] = t24;
endmodule
/* middle part: GF(2^4) inverse */
module Middle (p, l);
    input [3:0] p;
    output [3:0] l;
    wire t1, t2, t3, t4, t5, t6, t7, t8, t9, t10,
                t11, t12, t13, t14, t15;
    NAND m1(t1, p[3], p[0]);
    NOR m2(t2, t1, p[2]);
    NAND m3(t3, p[2], p[0]);
    XOR m4(t4, p[1], t3);
    NOR m5(t5, p[2], t4);
    NAND m6(t6, p[1], t4);
    NOR m7(t7, p[3], t4);
    NOR m8(t8, t7, t2);
    XNOR m9(t9, t5, t7);
    XNOR m10(t10, t9, p[3]);
```

```
    NAND m11(t11, t6, t8);
    NAND m12(t12, t8, p[1]);
    XNOR m13(t13, p[0], t12);
    NAND m14(t14, t1, p[2]);
    NAND m15(t15, t9, t14);
    assign l[3] = t13;
    assign l[2] = t11;
    assign l[1] = t15;
    assign l[0] = t10;
endmodule
/* bottom part: GF(2^4) multipliers */
module Bottom (g, m, l, e);
    input [7:0] g;
    input [9:0] m;
    input [3:0] l;
    output [17:0] e;
    wire k4, k3, k2, k1, k0;
    XOR m1(k4, l[3], l[2]);
    XOR m2(k3, l[3], l[1]);
    XOR m3(k2, l[2], l[0]);
    XOR m4(k1, k3, k2);
    XOR m5(k0, l[1], l[0]);
    NAND m6(e[17], g[2], l[2]);
    NAND m7(e[16], g[3], l[3]);
    NAND m8(e[15], m[8], k4);
    NAND m9(e[14], m[2], k1);
    NAND m10(e[13], m[4], k2);
    NAND m11(e[12], m[6], k3);
    NAND m12(e[11], g[0], l[0]);
    NAND m13(e[10], g[1], l[1]);
    NAND m14(e[9], m[0], k0);
    NAND m15(e[8], g[6], l[2]);
    NAND m16(e[7], g[7], l[3]);
    NAND m17(e[6], m[9], k4);
    NAND m18(e[5], m[3], k1);
    NAND m19(e[4], m[5], k2);
    NAND m20(e[3], m[7], k3)
    NAND m21(e[2], g[4], l[0]);
    NAND m22(e[1], g[5], l[1]);
    NAND m23(e[0], m[1], k0);
endmodule
/* output matrix */
module Output (e, Sb);
    input [17:0] e;
    output [7:0] Sb;
    wire E11, E10, E9, E8, E7, E6, E5, E4, E3, E2, E1, E0;
    wire t1, t2, t3, t4, t5, t6, t7, t8, t9, t10,
        t11, t12, t13, t14, t15, t16, t17;
    XOR m1(E11, e[17], e[16]);
```

```
XOR m2(E10, e[15], e[16]);
XOR m3(E9, e[14], e[13]);
XOR m4(E8, e[12], e[13])
XOR m5(E7, e[11], e[10]);
XOR m6(E6, e[9], e[10]);
XOR m7(E5, e[8], e[7]);
XOR m8(E4, e[6], e[7]);
XOR m9(E3, e[5], e[4]);
XOR m10(E2, e[3], e[4]);
XOR m11(E1, e[2], e[1]);
XOR m12(EO, e[0], e[1]);
XOR m13(t1, E2, E0);
XOR m14(t2, E10, t1);
XOR m15(t3, E8, t2);
XOR m16(t4, E5, E1);
XOR m17(t5, E5, E3);
XOR m18(t6, E7, t5);
XOR m19(t7, E8, E6);
XOR m20(t8, E2, t3);
XOR m21(t9, E4, t8);
XOR m22(t10, t1, t5);
XNOR m23(t11, E9, t6)
XNOR m24(t12, t4, t7);
XOR m25(t13, t3, t6);
XOR m26(t14, E11, t13);
XNOR m27(t15, t1, t9);
XOR m28(t16, t10, t12);
XOR m29(t17, t4, t9);
assign Sb = {t3, t15, t11, t9, t17, t14, t16, t12};
```

endmodule

B Verilog Code for the Camellia S-box

```
/* The Camellia S-box */
module Camellia (b, Sb);
    input [7:0] b;
    output [7:0] Sb;
    wire [7:0] g;
    wire [9:0] m;
    wire [3:0] p, l;
    wire [17:0] e;
    Input M1(b, g, m);
    Top M2(g, m, p);
    Middle M3(p, l);
    Bottom M4(g, m, l, e);
    Output M5(e, Sb);
endmodule
/* Gates implemented as modules to prevent
    unintentional optimization of the DC */
module XOR (t, a, b);
    output t;
    input a, b;
    xor(t, a, b);
endmodule
module XNOR (t, a, b);
    output t;
    input a, b;
    xnor(t, a, b);
endmodule
module NAND (t, a, b);
    output t;
    input a, b;
    nand(t, a, b);
endmodule
module NOR (t, a, b);
    output t;
    input a, b;
    nor(t, a, b);
endmodule
/* input matrix */
module Input (b, g, m);
    input [7:0] b;
    output [7:0] g;
    output [9:0] m;
    wire t1, t2, t3, t4, t5, t6, t7, t8, t9, t10,
                t11, t12, t13, t14, t15, t16, t17, t18, t19;
    XNOR m1(t1, b[6], b[3]);
    XNOR m2(t2, b[4], b[2]);
    XOR m3(t3, b[4], b[1]);
    XOR m4(t4, t1, t3);
    XOR m5(t5, t2, t4);
    XOR m6(t6, b[5], t2);
    XNOR m7(t7, b[0], t6);
    XOR m8(t8, b[7], b[0]);
```

```
    XOR m9(t9, t1, t8);
    XOR m10(t10, t5, t9);
    XOR m11(t11, b[1], t9);
    XNOR m12(t12, b[0], t11);
    XOR m13(t13, b[4], t11);
    XNOR m14(t14, b[2], t11);
    XOR m15(t15, b[6], b[5]);
    XNOR m16(t16, t9, t15);
    XOR m17(t17, t7, t16);
    XNOR m18(t18, b[1], t15);
    XOR m19(t19, t6, t18);
    assign g = {t5, t4, t10, t1, t16, t17, t11, t12};
    assign m = {t2, t7, t9, t18, t3, t19, t13, t6, t14, ~b[0]};
endmodule
/* top part: GF(2^4) multiplier and GF(2^4) square-scaler */
module Top (g, m, d);
    input [7:0] g;
    input [9:0] m;
    output [3:0] d;
    wire t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13,
        t14, t15, t16, t17, t18, t19, t20, t21, t22, t23, t24;
    NAND m1(t1, g[7], g[3]);
    NOR m2(t2, g[6], g[2]);
    NOR m3(t3, g[7], g[3]);
    NAND m4(t4, m[9], m[8]);
    XOR m5(t13, t1, t2);
    XOR m6(t14, t3, t4);
    NAND m7(t5, m[5], m[4]);
    NAND m8(t6, m[3], m[2]);
    NAND m9(t7, m[7], m[6]);
    NOR m10(t8, m[3], m[2]);
    NOR m11(t9, m[5], m[4]);
    XOR m12(t15, t5, t6);
    XOR m13(t16, t7, t5);
    XOR m14(t17, t5, t8);
    XOR m15(t18, t7, t9);
    NAND m16(t10, g[5], g[1]);
    NAND m17(t11, g[4], g[0]);
    NAND m18(t12, m[1], m[0]);
    XOR m19(t19, t10, t11);
    XOR m20(t20, t10, t12);
    XOR m21(t21, t13, t15);
    XOR m22(t22, t14, t16);
    XOR m23(t23, t19, t17);
    XOR m24(t24, t20, t18);
    assign d[3] = t21;
    assign d[2] = t22;
    assign d[1] = t23;
    assign d[0] = t24;
endmodule
/* middle part: GF(2^4) inverse */
module Middle (p, l);
```

```
    input [3:0] p;
    output [3:0] l;
    wire t1, t2, t3, t4, t5, t6, t7, t8, t9, t10,
    t11, t12, t13, t14, t15;
NAND m1(t1, p[3], p[0]);
NOR m2(t2, t1, p[2]);
NAND m3(t3, p[2], p[0]);
XOR m4(t4, p[1], t3);
NOR m5(t5, p[2], t4);
NAND m6(t6, p[1], t4);
NOR m7(t7, p[3], t4);
NOR m8(t8, t7, t2);
XNOR m9(t9, t5, t7);
XNOR m10(t10, t9, p[3]);
NAND m11(t11, t6, t8);
NAND m12(t12, t8, p[1]);
XNOR m13(t13, p[0], t12);
NAND m14(t14, t1, p[2]);
NAND m15(t15, t9, t14);
assign l[3] = t13;
assign l[2] = t11;
assign l[1] = t15;
assign l[0] = t10;
endmodule
/* bottom part: GF(2^4) multipliers */
module Bottom (g, m, l, e);
    input [7:0] g;
    input [9:0] m;
    input [3:0] 1;
    output [17:0] e;
    wire k4, k3, k2, k1, k0;
    XOR m1(k4, l[3], l[2]);
    XOR m2(k3, l[3], l[1]);
    XOR m3(k2, l[2], l[0]);
    XOR m4(k1, k3, k2);
    XOR m5(k0, l[1], l[0]);
    NAND m6(e[17], g[2], l[2]);
    NAND m7(e[16], g[3], l[3]);
    NAND m8(e[15], m[8], k4);
    NAND m9(e[14], m[2], k1);
    NAND m10(e[13], m[4], k2);
    NAND m11(e[12], m[6], k3);
    NAND m12(e[11], g[0], l[0]);
    NAND m13(e[10], g[1], l[1]);
    NAND m14(e[9], m[0], k0);
    NAND m15(e[8], g[6], l[2]);
    NAND m16(e[7], g[7], l[3]);
    NAND m17(e[6], m[9], k4);
    NAND m18(e[5], m[3], k1)
    NAND m19(e[4], m[5], k2);
    NAND m20(e[3], m[7], k3)
```

```
    NAND m21(e[2], g[4], l[0]);
    NAND m22(e[1], g[5], l[1]);
    NAND m23(e[0], m[1], k0);
endmodule
/* output matrix */
module Output (e, Sb);
    input [17:0] e;
    output [7:0] Sb;
    wire E11, E10, E9, E8, E7, E6, E5, E4, E3, E2, E1, E0;
    wire t1, t2, t3, t4, t5, t6, t7, t8, t9, t10,
            t11, t12, t13, t14, t15, t16;
    XOR m1(E11, e[17], e[16]);
    XOR m2(E10, e[15], e[16]);
    XOR m3(E9, e[14], e[13]);
    XOR m4(E8, e[12], e[13])
    XOR m5(E7, e[11], e[10]);
    XOR m6(E6, e[9], e[10]);
    XOR m7(E5, e[8], e[7]);
    XOR m8(E4, e[6], e[7]);
    XOR m9(E3, e[5], e[4]);
    XOR m10(E2, e[3], e[4]);
    XOR m11(E1, e[2], e[1]);
    XOR m12(EO, e[0], e[1]);
    XNOR m13(t1, E2, EO);
    XNOR m14(t2, E10, t1);
    XOR m15(t3, E6, t2);
    XOR m16(t4, E11, t3);
    XOR m17(t5, E9, t4);
    XOR m18(t6, E11, E7);
    XOR m19(t7, E5, t6);
    XNOR m20(t8, E4, E0);
    XNOR m21(t9, t5, t8);
    XOR m22(t10, t2, t9);
    XNOR m23(t11, E1, t1)
    XOR m24(t12, t7, t9);
    XNOR m25(t13, E5, t11);
    XNOR m26(t14, E8, t10);
    XNOR m27(t15, E3, t12);
    XOR m28(t16, t7, t11);
    assign Sb = {t3, t1, t15, t5, t13, t14, t8, t16};
endmodule
```

C Verilog Code for the SM4 S-box

```
/* The SM4 S-box */
module SM4 (b, Sb);
    input [7:0] b;
    output [7:0] Sb;
    wire [7:0] g;
    wire [9:0] m;
    wire [3:0] p, l;
    wire [17:0] e;
    Input M1(b, g, m);
    Top M2(g, m, p);
    Middle M3(p, l);
    Bottom M4(g, m, l, e);
    Output M5(e, Sb);
endmodule
/* Gates implemented as modules to prevent
    unintentional optimization of the DC */
module XOR (t, a, b)
    output t;
    input a, b;
    xor(t, a, b);
endmodule
module XNOR (t, a, b);
    output t;
    input a, b;
    xnor(t, a, b);
endmodule
module NAND (t, a, b);
    output t;
    input a, b;
    nand(t, a, b);
endmodule
module NOR (t, a, b);
    output t;
    input a, b;
    nor(t, a, b);
endmodule
/* input matrix */
module Input (b, g, m);
    input [7:0] b;
    output [7:0] g;
    output [9:0] m;
    wire t1, t2, t3, t4, t5, t6, t7, t8, t9, t10,
                t11, t12, t13, t14, t15, t16, t17;
    XOR m1(t1, b[7], b[5]);
    XNOR m2(t2, b[5], b[1]);
    XNOR m3(t3, b[0], t2);
    XOR m4(t4, b[6], b[2]);
    XOR m5(t5, b[3], t3);
    XOR m6(t6, b[4], t1);
    XOR m7(t7, b[1], t5);
    XOR m8(t8, b[1], t4);
```

```
XOR m9(t9, t6, t8);
XOR m10(t10, t6, t7);
XNOR m11(t11, b[3], t1);
XNOR m12(t12, b[6], t9);
XOR m13(t13, t4, t10);
XOR m14(t14, t2, t11);
XOR m15(t15, t12, t14);
XOR m16(t16, t3, t12);
XOR m17(t17, t11, t16);
    assign g = {t15, t14, ~b[0], t2, t5, t13, t7, t10};
    assign m = {t12, t9, t17, b[1], t11, t4, t16, t8, t3, t6};
endmodule
/* top part: GF(2^4) multiplier and GF(2^4) square-scaler */
module Top (g, m, d);
    input [7:0] g;
    input [9:0] m;
    output [3:0] d;
    wire t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13,
                t14, t15, t16, t17, t18, t19, t20, t21, t22, t23, t24;
    NAND m1(t1, g[5], g[1]);
    NAND m2(t2, m[1], m[0]);
    NAND m3(t3, g[4], g[0]);
    NAND m4(t4, g[7], g[3]);
    NAND m5(t5, m[9], m[8]);
    NOR m6(t6, g[6], g[2]);
    NOR m7(t7, g[7], g[3]);
    NOR m8(t8, m[9], m[8]);
    NOR m9(t9, m[7], m[6]);
    NAND m10(t10, m[3], m[2]);
    NAND m11(t11, m[5], m[4]);
    NOR m12(t12, m[3], m[2]);
    XOR m13(t13, t1, t2);
    XOR m14(t14, t3, t2);
    XOR m15(t15, t4, t13);
    XOR m16(t16, t5, t14);
    XOR m17(t17, t9, t10);
    XOR m18(t18, t11, t12);
    XOR m19(t19, t6, t15);
    XOR m20(t20, t7, t16);
    XOR m21(t21, t19, t17);
    XOR m22(t22, t20, t18);
    XOR m23(t23, t8, t15);
    XOR m24(t24, t6, t16);
    assign d[3] = t21;
    assign d[2] = t22;
    assign d[1] = t23;
    assign d[0] = t24;
endmodule
/* middle part: GF(2^4) inverse */
module Middle (p, l);
    input [3:0] p;
    output [3:0] 1;
    wire t1, t2, t3, t4, t5, t6, t7, t8, t9, t10,
```

t11, t12, t13, t14, t15;

```
NAND m1(t1, p[3], p[0]);
NOR m2(t2, t1, p[2]);
NAND m3(t3, p[2], p[0]);
XOR m4(t4, p[1], t3);
NOR m5(t5, p[2], t4);
NAND m6(t6, p[1], t4);
NOR m7(t7, p[3], t4);
NOR m8(t8, t7, t2);
XNOR m9(t9, t5, t7);
XNOR m10(t10, t9, p[3]);
NAND m11(t11, t6, t8);
NAND m12(t12, t8, p[1]);
XNOR m13(t13, p[0], t12);
NAND m14(t14, t1, p[2]);
NAND m15(t15, t9, t14);
assign l[3] = t13;
assign l[2] = t11;
assign l[1] = t15;
assign l[0] = t10;
```

endmodule
/* bottom part: GF (2^4) multipliers */
module Bottom (g, m, l, e);
input [7:0] g;
input [9:0] m;
input [3:0] 1;
output [17:0] e;
wire k4, k3, k2, k1, k0;
XOR m1 (k4, l[3], l[2]);
XOR m2 (k3, l[3], l[1]);
XOR m3(k2, l[2], l[0]);
XOR m4 (k1, k3, k2);
XOR m5(k0, l[1], l[0]);
NAND m6(e[17], g[2], l[2]);
NAND m7 (e[16], g[3], l[3]);
NAND m8(e[15], m[8], k4)
NAND m9(e[14], m[2], k1);
NAND m10(e[13], m[4], k2);
NAND m11(e[12], m[6], k3);
NAND m12(e[11], g[0], l[0]);
NAND m13(e[10], g[1], l[1]);
NAND m14 (e[9], m[0], k0);
NAND m15(e[8], g[6], l[2]);
NAND m16(e[7], g[7], l[3]);
NAND m17(e[6], m[9], k4);
NAND m18(e[5], m[3], k1);
NAND m19 (e[4], m[5], k2)
NAND m20(e[3], m[7], k3);
NAND m21(e[2], g[4], l[0]);
NAND m22(e[1], g[5], l[1]);

```
    NAND m23(e[0], m[1], k0);
endmodule
/* output matrix */
module Output (e, Sb);
    input [17:0] e;
    output [7:0] Sb;
    wire E11, E10, E9, E8, E7, E6, E5, E4, E3, E2, E1, E0;
    wire t1, t2, t3, t4, t5, t6, t7, t8, t9, t10,
        t11, t12, t13, t14, t15, t16;
    XOR m1(E11, e[17], e[16]);
    XOR m2(E10, e[15], e[16]);
    XOR m3(E9, e[14], e[13]);
    XOR m4(E8, e[12], e[13]);
    XOR m5(E7, e[11], e[10]);
    XOR m6(E6, e[9], e[10]);
    XOR m7(E5, e[8], e[7]);
    XOR m8(E4, e[6], e[7]);
    XOR m9(E3, e[5], e[4]);
    XOR m10(E2, e[3], e[4]);
    XOR m11(E1, e[2], e[1]);
    XOR m12(E0, e[0], e[1]);
    XOR m13(t1, E9, E7);
    XOR m14(t2, E1, t1);
    XOR m15(t3, E3, t2);
    XOR m16(t4, E5, E3);
    XOR m17(t5, E4, t4);
    XOR m18(t6, E4, E0);
    XOR m19(t7, E11, E7);
    XOR m20(t8, t1, t4);
    XOR m21(t9, t1, t6);
    XOR m22(t10, E2, t5);
    XOR m23(t11, E10, E8);
    XNOR m24(t12, t3, t11);
    XOR m25(t13, t10, t12);
    XNOR m26(t14, t3, t7);
    XNOR m27(t15, E10, E6);
    XOR m28(t16, t6, t14);
    assign Sb = {t15, t13, t8, t14, t11, t9, t12, t16};
endmodule
```


[^0]:    * The corresponding author

    1. State Key Laboratory of Information Security, Institute of Information Engineering, CAS
    2. Data Assurance and Communication Security Research Center, CAS
    3. School of Cyber Security, University of Chinese Academy of Sciences
    4. Department of Mathematics and Computer Science, University of Southern Denmark
    5. Information Technology Laboratory, NIST, USA
[^1]:    ${ }^{1}$ There are $2^{2}$ choices for $T, 2^{2}$ choices for $N, 2^{4}$ choices for $\tau, 2^{4}$ choices for $\nu$, and 2 choices for each of $W, Z$ and $Y$ whenever $T, N, \tau, \nu$ are fixed.

[^2]:    ${ }^{2}$ It costs about 38 minutes to obtain this 15 -gate circuit on a PC. But we cannot confirm whether there is a 14 -gate circuit, since we terminate the SAT solver after about two weeks' computation

[^3]:    ${ }^{3}$ It costs about 25 days on a PC to produce this result.

[^4]:    ${ }^{4}$ It costs about 30 days on a PC to produce this circuit.

[^5]:    ${ }^{5}$ We do not have access to the STM 65 nm technology library. However, the authors of [26] provide sufficient area information for the gates involved in this particular library, based on which we can extrapolate the results without any difficulty.

