# FPGA and ASIC Implementations of the $\eta_{T}$ Pairing in Characteristic Three 

Jean-Luc Beuchat, Hiroshi Doi, Kaoru Fujita, Atsuo Inomata, Piseth Ith, Akira Kanaoka, Masayoshi Katouno, Masahiro Mambo, Eiji Okamoto, Takeshi Okamoto, Takaaki Shiga, Masaaki Shirase, Ryuji Soga, Tsuyoshi<br>Takagi, Ananda Vithanage, and Hiroyasu Yamamoto


#### Abstract

Since their introduction in constructive cryptographic applications, pairings over (hyper)elliptic curves are at the heart of an ever increasing number of protocols. As they rely critically on efficient implementations of pairing primitives, the study of hardware accelerators has become an active research area.

In this paper, we propose two coprocessors for the reduced $\eta_{T}$ pairing introduced by Barreto et al. as an alternative means of computing the Tate pairing on supersingular elliptic curves. We prototyped our architectures on FPGAs. According to our place-and-route results, our coprocessors compare favorably with other solutions described in the open literature. We also present the first ASIC implementation of the reduced $\eta_{T}$ pairing.


Index Terms-Tate pairing, $\eta_{T}$ pairing, elliptic curve cryptography, finite field arithmetic, hardware accelerator.

## I. Introduction

In the mid-nineties, Menezes, Okamoto \& Vanstone [36] and Frey \& Rück [18] introduced the Weil and Tate pairings in cryptography as a tool to attack the discrete logarithm problem on some classes of elliptic curves defined over finite fields. A few years later, Mitsunari, Sakai \& Kasahara [39], Sakai, Oghishi \& Kasahara [44], and Joux [28] discovered constructive properties of pairings. Their respective works initiated an extensive study of pairing-based cryptography, and an ever increasing number of protocols based on the Weil or the Tate pairing have appeared in the literature: identity-based encryption [11], short signature [13], and efficient broadcast encryption [12] to mention but a few. As noticed by Dutta,

[^0]Barua \& Sarkar [14], such protocols rely critically on efficient algorithms and implementations of pairing primitives.

According to [22], [32], when dealing with general curves providing common levels of security, the Tate pairing seems to be more efficiently computable than the Weil pairing. In 1986, Miller described the first iterative algorithm to compute the Tate pairing [37], [38]. Significant improvements were independently proposed by Barreto et al. [4] and Galbraith et al. [19] in 2002. One year later, Duursma \& Lee gave a closed formula in the case of characteristic three [15]. In 2004, Barreto et al. [3] introduced the $\eta_{T}$ approach, which further shortens the loop of Miller's algorithm.
This paper describes the design of two hardware accelerators for the $\eta_{T}$ pairing in characteristic three. Section II provides the reader with a brief overview of pairing computation. As detailed in that section, the considered pairing algorithm relies heavily on arithmetic over $\mathbb{F}_{36 m}$, a degree6 extension of the base field of the curve. However, thanks to a tower field representation, all operations over $\mathbb{F}_{3}{ }^{6 m}$ can be replaced by arithmetic over $\mathbb{F}_{3^{m}}$. We describe hardware arithmetic operators over $\mathbb{F}_{3^{m}}$ and explain how to take advantage of the tower field in Section III. We then propose two hardware accelerators for the $\eta_{T}$ pairing (Section IV). We have prototyped our architectures on FPGA, and propose the first ASIC implementation of the $\eta_{T}$ pairing in characteristic three. Section V summarizes our implementation results on FPGA and ASIC, and provides the reader with a comprehensive comparison with previously published architectures.

## II. Computation of the Modified Tate Pairing in Characteristic Three

Given a positive integer $m$ coprime to 6 , we consider a supersingular ${ }^{1}$ elliptic curve $E$ over $\mathbb{F}_{3^{m}}$, defined by the equation $y^{2}=x^{3}-x+b$, with $b \in\{-1,1\}$. According to [3], there is no loss of generality from considering this case since these curves offer the same level of security for pairing applications as any supersingular elliptic curve over $\mathbb{F}_{3^{m}}$. The number $N$ of rational points of $E$ over the finite field $\mathbb{F}_{3^{m}}$ is given by $N=\# E\left(\mathbb{F}_{3^{m}}\right)=3^{m}+1+\mu b 3^{\frac{m+1}{2}}$, with

$$
\mu=\left\{\begin{array}{ll}
+1 & \text { if } m \equiv 1,11 \\
-1 & \text { if } m \equiv 5,7
\end{array} \quad(\bmod 12), \text { or } 12\right) . ~ \$
$$

[^1]
## A. Modified Tate Pairing

Let $\ell$ be the largest prime factor of $N . E\left(\mathbb{F}_{3^{m}}\right)[\ell]$ denotes the $\ell$-torsion subgroup of $E\left(\mathbb{F}_{3^{m}}\right)$, i.e. the set of points $P \in$ $E\left(\mathbb{F}_{3^{m}}\right)$ such that $[\ell] P=\mathcal{O}$, where $\mathcal{O}$ is the point at infinity of the elliptic curve $E$. The modified Tate pairing is a function that takes as input two points of $E\left(\mathbb{F}_{3^{m}}\right)[\ell]$ and outputs an element of the group of $\ell$ th roots of unity $\mu_{\ell}=\left\{R \in \mathbb{F}_{3^{k m}}^{*}\right.$ : $\left.R^{\ell}=1\right\}$.

The embedding degree or security multiplier is the least positive integer $k$ for which $\mu_{\ell}$ is contained in the multiplicative group $\mathbb{F}_{3^{k m}}^{*}$ (i.e. $k$ is the smallest integer such that $\ell$ divides $3^{k m}-1$ ). The considered curve has an embedding degree of $k=6$, which is the maximum value possible for supersingular elliptic curves, and hence seems to be an attractive choice for pairing implementation.

The modified Tate pairing of order $\ell$ is then the map

$$
\hat{e}(\cdot, \cdot): E\left(\mathbb{F}_{3^{m}}\right)[\ell] \times E\left(\mathbb{F}_{3^{m}}\right)[\ell] \rightarrow \mathbb{F}_{3^{6 m}}^{*}
$$

given by

$$
\hat{e}(P, Q)=f_{\ell, P}(\psi(Q))^{\left(3^{6 m}-1\right) / \ell}
$$

where

- $\psi$ is a distortion map (the concept of a distortion map was introduced in [48]) from $E\left(\mathbb{F}_{3^{m}}\right)[\ell]$ to $E\left(\mathbb{F}_{3^{6 m}}\right)[\ell] \backslash$ $E\left(\mathbb{F}_{3^{m}}\right)[\ell]$ defined as $\psi\left(x_{Q}, y_{Q}\right)=\left(\rho-x_{Q}, y \sigma\right)$ for all $Q=\left(x_{Q}, y_{Q}\right) \in E\left(\mathbb{F}_{3^{m}}\right)[\ell]$, where $\rho$ and $\sigma$ are elements of $\mathbb{F}_{3^{6 m}}$ satisfying the equations $\rho^{3}-\rho-b=0$ and $\sigma^{2}+1=0$ [4]. Note that $\left\{1, \sigma, \rho, \sigma \rho, \rho^{2}, \sigma \rho^{2}\right\}$ is a basis of $\mathbb{F}_{3^{6 m}}$ over $\mathbb{F}_{3^{m}}$. We will therefore represent an element $R \in \mathbb{F}_{36 m}$ as $R=r_{0}+r_{1} \sigma+r_{2} \rho+r_{3} \sigma \rho+r_{4} \rho^{2}+r_{5} \sigma \rho^{2}$, where the $r_{i}$ 's belong to $\mathbb{F}_{3^{m}}$.
- $f_{n, P}$, for $n \in \mathbb{N}$ and $P \in E\left(\mathbb{F}_{3^{m}}\right)[\ell]$ is a rational function defined over $E\left(\mathbb{F}_{3^{6 m}}\right)[\ell]$ with divisor $\left(f_{n, P}\right)=n(P)-$ $([n] P)-(n-1)(\mathcal{O})$ (see [46] or [49] for an account of divisors). We consider here the definition proposed by Barreto et al. [4], where $f_{n, P}$ is evaluated on a point rather than on a divisor.
- $f_{\ell, P}(\psi(Q))$ is only defined up to $\ell$ th powers, which is undesirable in most of the cryptographic applications. The powering by $\left(3^{6 m}-1\right) / \ell$, referred to as final exponentiation, allows one to obtain a unique value in a multiplicative subgroup of $\mathbb{F}_{36 m}^{*}$.
Choosing an order of low Hamming weight provides computational savings in Miller's algorithm. However, $\ell$ being a quotient of $N$ by a small cofactor, it does not have a small Hamming weight. Galbraith et al. [19] noted that one can compute the modified Tate pairing of order $\ell$ with respect to the group order $N$ (note that $N$ divides $3^{6 m}-1$ ):

$$
f_{\ell, P}(\psi(Q))^{\left(3^{6 m}-1\right) / \ell}=f_{N, P}(\psi(Q))^{\left(3^{6 m}-1\right) / N}
$$

In the following, $M$ denotes the final exponent of the modified Tate pairing of order $N$ :
$M=\frac{3^{6 m}-1}{N}=\left(3^{3 m}-1\right)\left(3^{m}+1\right)\left(3^{m}+1-\mu b 3^{\frac{m+1}{2}}\right)$.
The modified Tate pairing satisfies the following properties:

- Bilinearity. For all $A, B, C \in \mathbb{F}_{3^{m}}[\ell]$,

$$
\begin{aligned}
& \hat{e}(A+B, C)=\hat{e}(A, C) \hat{e}(B, C) \quad \text { and } \\
& \hat{e}(A, B+C)=\hat{e}(A, B) \hat{e}(A, C)
\end{aligned}
$$

- Non-degeneracy. $\hat{e}(P, P) \neq 1$, for all $P \neq \mathcal{O}$.
- Computability. $\hat{e}$ can be efficiently computed.


## B. The Duursma-Lee Approach

Duursma \& Lee [15] proposed to compute the order $3^{3 m}+1$ modified Tate pairing. This approach simplifies both Miller's algorithm and the final exponentiation ${ }^{2}$. Furthermore, Duursma \& Lee showed that the number of iterations of Miller's algorithm can be reduced from $3 m$ to $m$ iterations [15].

## C. The $\eta_{T}$ Approach

Barreto et al. [3] introduced the $\eta_{T}$ pairing as "an alternative means of computing the Tate pairing on certain supersingluar curves" [40, page 108]. They suggest to compute $\hat{e}(P, Q)$ using an order $T \in \mathbb{Z}$ that is smaller than $N$. Their main result is a lemma which gives a method to select $T$ such that $\eta_{T}(P, Q)^{M}$ is a non-degenerate bilinear pairing [3]. In characteristic three they choose $T=3^{m}-N=-\mu b 3^{\frac{m+1}{2}}-1$ and show that their method gives a further halving of the length of the loop compared to the Duursma \& Lee approach. The $\eta_{T}$ pairing is defined as follows:

$$
\eta_{T}(P, Q)=\left\{\begin{align*}
f_{T, P}(\psi(Q)) & \text { if } T>0, \text { or }  \tag{1}\\
f_{-T,-P}(\psi(Q)) & \text { if } T<0
\end{align*}\right.
$$

Defining $T^{\prime}=-\mu b T=3^{\frac{m+1}{2}}+\mu b$ and $P^{\prime}=[-\mu b] P$, we rewrite Equation (1) as $\eta_{T}(P, Q)^{M}=f_{T^{\prime}, P^{\prime}}(\psi(Q))^{M}$. Then, the techniques proposed by Duursma \& Lee [15] allow one to simplify the computation of $f_{n, P}$ in Miller's algorithm:

$$
f_{T^{\prime}, P^{\prime}}(\psi(Q))=\left(\prod_{i=0}^{\frac{m-1}{2}} g_{\left[3^{i}\right] P^{\prime}}(\psi(Q))^{\frac{m-1}{2}-i}\right) l_{P^{\prime}}(\psi(Q))
$$

where

- $g_{V}$ is the rational function introduced by Duursma \& Lee [15], defined over $E\left(\mathbb{F}_{3^{6 m}}\right)[\ell]$, and having divisor $\left(g_{V}\right)=3(V)+([-3] V)-4(\mathcal{O})$. For all $V=\left(x_{V}, y_{V}\right) \in$ $E\left(\mathbb{F}_{3^{m}}\right)[\ell]$ and $(x, y) \in E\left(\mathbb{F}_{3}{ }^{6 m}\right)[\ell]$, it is defined as:

$$
g_{V}(x, y)=y_{V}^{3} y-\left(x_{V}^{3}-x+b\right)^{2}
$$

- $l_{V}$, for all $V=\left(x_{V}, y_{V}\right) \in E\left(\mathbb{F}_{3^{m}}\right)[\ell]$, is the equation of the line corresponding to the addition of $\left[3^{\frac{m+1}{2}}\right] V$ with $[\mu b] V$. It is defined for all $(x, y) \in E\left(\mathbb{F}_{3^{6 m}}\right)[\ell]$ :

$$
l_{V}(x, y)=y-(-1)^{\frac{m+1}{2}} y_{V}\left(x-x_{V}\right)-\mu b y_{V}
$$

As pointed out by Barreto et al. [3], the computation of $f_{T^{\prime}, P^{\prime}}(\psi(Q))$ requires cubings over $\mathbb{F}_{3^{6 m}}$ because of the exponent $3^{\frac{m-1}{2}-i}$ inside the main product. They suggested to bring the powering into the formulae as a Frobenius action, or to compute the product in reverse. Both approaches allow one to replace two cubings over $\mathbb{F}_{3^{m}}$ and one cubing over

[^2]$\mathbb{F}_{3^{6 m}}$ by two cube roots over $\mathbb{F}_{3^{m}}$ at each iteration. However, the second one turns out to be slightly more effective since it also saves three multiplications over $\mathbb{F}_{3^{m}}$ when multiplying by $l_{P^{\prime}}(\psi(Q))$ (see [7] for further details). Note that the DuursmaLee algorithm also comes in two flavors: the original one involves cube roots and Kwon proposed a cube root-free version in [34].

```
Algorithm 1 Cube-root-free reversed-loop algorithm for com-
puting the \(\eta_{T}\) pairing [7].
Input: \(P, Q \in E\left(\mathbb{F}_{3^{m}}\right)[\ell]\). The algorithm involves a local
    variable \(t \in \mathbb{F}_{3^{m}}\), and two local variables \(R\) and \(S \in \mathbb{F}_{3^{6 m}}\).
Output: \(\eta_{T}(P, Q)^{M} \in \mathbb{F}_{3^{6 m}}^{*}\).
    \(x_{P} \leftarrow x_{P}+b ;\)
    \(y_{P} \leftarrow-\mu b y_{P} ;\)
    \(x_{Q} \leftarrow x_{Q}^{3} ; \quad y_{Q} \leftarrow y_{Q}^{3} ;\)
    \(t \leftarrow x_{P}+x_{Q} ;\)
    \(R \leftarrow\left(y_{P} t-y_{Q} \sigma-y_{P} \rho\right) \cdot\left(-t^{2}+y_{P} y_{Q} \sigma-t \rho-\rho^{2}\right) ;\)
    for \(j \leftarrow 1\) to \(\frac{m-1}{2}\) do
        \(R \leftarrow R^{3} ;\)
        \(x_{Q} \leftarrow x_{Q}^{9}-b ; \quad y_{Q} \leftarrow-y_{Q}^{9} ;\)
        \(t \leftarrow x_{P}+x_{Q} ; \quad u \leftarrow y_{P} y_{Q} ;\)
        \(S \leftarrow-t^{2}+u \sigma-t \rho-\rho^{2} ;\)
        \(R \leftarrow R \cdot S ;\)
    end for
    return \(R^{M}\);
```

Fong et al. showed that extracting a square root in $\mathbb{F}_{2^{m}}$ requires approximately the time of a field multiplication and proposed an improved scheme for trinomials [17]. Barreto extended this approach to cube root in characteristic three [2]: if $\mathbb{F}_{3^{m}}$ admits an irreducible trinomial $x^{m}+f_{n} x^{n}+f_{0}\left(f_{n}\right.$, $\left.f_{0} \in\{-1,1\}\right)$ with the property $n \equiv m(\bmod 3)$, then five shifts and five additions allow one to implement this operation. Nevertheless, even if computing a cube root is not a difficult operation, it requires specific hardware and a slightly more complex control and datapath. In this work, we decided to minimize the area of the Arithmetic and Logic Unit (ALU) and considered a cube root-free version of the reversed-loop approach described by Algorithm 1. Consider the operand $S \in \mathbb{F}_{3^{6 m}}$ (line 10) and note that it is sparse (i.e. some of its terms are trivial). This property will allow us to optimize the computation of $R \cdot S$ in Section III-B2.

The relationship between the modified Tate pairing and the reduced $\eta_{T}$ pairing is given by [6]:

$$
\hat{e}(P, Q)^{M}=\eta_{T}\left([-\mu b] P,\left[3^{\frac{3 m-1}{2}}\right] Q\right)^{M}
$$

where $[-\mu b] P=\left(x_{P},-\mu b y_{P}\right)$ and $\left[3^{\frac{3 m-1}{2}}\right] Q=$ $\left(\sqrt[3]{x_{Q}}-b,(-1)^{\frac{m+1}{2}} \sqrt[3]{y_{Q}}\right)$. We can modify Algorithm 1 as follows to obtain $\hat{e}(P, Q)^{M}$ :

- Since we compute the pairing with $\left(x_{p},-\mu b y_{P}\right)$, line 2 becomes $y_{p} \leftarrow-\mu b \cdot\left(-\mu b y_{P}\right)=y_{p}$ and can be discarded.
- It is no longer necessary to compute the cube of $x_{Q}$ and $y_{Q}$ (line 3). We have now $x_{Q} \leftarrow x_{Q}-b$.
- Let $x_{P}^{\prime}=x_{P}+b$ and $x_{Q}^{\prime}=x_{Q}-b$. Since $t=x_{P}^{\prime}+x_{Q}^{\prime}=$ $x_{P}+x_{Q}$ (line 4), we can actually remove lines 1 and 3 .
It is worth noticing that we obtain a cube root-free algorithm and that the modified Tate pairing requires less operations than the reduced $\eta_{T}$ pairing in this case.


## D. Final Exponentiation

Fermat's little theorem provides us with an effective way to perform the final exponentiation of the reduced $\eta_{T}$ pairing. As pointed out by Barreto et al., "the result of raising to $3^{3 m}-1$ produces an element of order $3^{3 m}+1$, so that any further inversion reduces to a simple conjugation" [3, page 248]. The main loop of Algorithm 1 returns $R=\eta_{T}(P, Q) \in \mathbb{F}_{3^{6 m}}^{*}$. Writing $R=R_{0}+R_{1} \sigma$, where $R_{0}$ and $R_{1} \in \mathbb{F}_{3^{3 m}}^{*}$, we obtain:

$$
V=R^{3^{3 m}-1}=\frac{\left(R_{0}^{2}-R_{1}^{2}\right)+R_{0} R_{1} \sigma}{R_{0}^{2}+R_{1}^{2}}
$$

Algorithm 2 summarizes the computation of the final exponentiation. When $\mu b=-1$, the computation of $W^{\prime}=W^{-\mu b}$ on line 4 is a dummy operation. Let us write $W=W_{0}+W_{1} \sigma$, where $W_{0}$ and $W_{1} \in \mathbb{F}_{33 m}^{*}$. Since $W$ is an element of order $3^{3 m}+1$ [3], the inversion is completely free when $\mu b=1$ :

$$
\begin{aligned}
W^{\prime} & =W^{-1}=W^{3^{3 m}}=\left(W_{0}+\sigma W_{1}\right)^{3^{3 m}} \\
& =W_{0}^{3^{3 m}}+\sigma^{3^{3 m}} W_{1}^{3^{3 m}}=W_{0}-\sigma W_{1} .
\end{aligned}
$$

It suffices to propagate the sign corrections in the product $V \cdot W^{\prime}$. Whereas the computation of $\eta_{T}(P, Q)$ involves only sparse multiplications over $\mathbb{F}_{36 m}$ (Algorithm 1, line 11), the final exponentiation requires a full multiplication over $\mathbb{F}_{3}{ }^{6 m}$ (Algorithm 2, line 6). Note that the computation of $V$ and $W$ involves only operations over $\mathbb{F}_{3}{ }^{3 m}$. Algorithms to compute $R^{3^{3 m}-1}$ and $V^{3^{m}+1}$ are for instance detailed in [7].

```
Algorithm 2 Final exponentiation of the reduced \(\eta_{T}\) pairing.
Input: \(R=\eta_{T}(P, Q) \in \mathbb{F}_{3^{6 m}}^{*}\).
Output: \(R^{M} \in \mathbb{F}_{3^{6 m}}^{*}\).
    \(V \leftarrow R^{3^{3 m}-1} ;\)
    \(V \leftarrow V^{3^{m}+1}\);
    \(W \leftarrow V^{3^{\frac{m+1}{2}}} ;\)
    \(W^{\prime} \leftarrow W^{-\mu b}\);
    \(V \leftarrow V^{3^{m}+1}\)
    return \(V \cdot W^{\prime}\);
```


## III. ARITHMETIC OVER $\mathbb{F}_{3^{m}}$ AND $\mathbb{F}_{3^{6 m}}$

Thanks to the tower field representation, all operations over $\mathbb{F}_{3^{6 m}}$ and $\mathbb{F}_{3^{3 m}}$ in Algorithms 1 and 2 can be replaced by arithmetic over $\mathbb{F}_{3^{m}}$. For instance, 12 multiplications, 11 additions, and a single inversion over $\mathbb{F}_{3^{m}}$ allow one to carry out the inversion over $\mathbb{F}_{3^{3 m}}$ involved in the computation of $V=R^{3^{3 m}-1}$. We describe here the hardware operators we designed for arithmetic over $\mathbb{F}_{3^{m}}$ (Section III-A) and the algorithms for sparse multiplication and cubing over $\mathbb{F}_{3^{6 m}}$ (Section III-B). We refer the reader to [7] for further details about other operations.

## A. Arithmetic over $\mathbb{F}_{3^{m}}$

In the following, elements of $\mathbb{F}_{3^{m}}$ are encoded using a polynomial basis. Given a degree- $m$ irreducible polynomial $f(x) \in \mathbb{F}_{3}[x]$, we have $\mathbb{F}_{3^{m}} \cong \mathbb{F}_{3}[x] /(f(x))$. Consequently, each element of $\mathbb{F}_{3^{m}}$ is represented as a polynomial of degree less than $m$ with coefficients in $\mathbb{F}_{3}$.

1) Addition and Subtraction over $\mathbb{F}_{3^{m}}$ : Since they are performed component-wise, addition and subtraction over $\mathbb{F}_{3^{m}}$ are rather straightforward operations. Each element of $\mathbb{F}_{3}$ being encoded by two bits, the addition of $a_{i}$ and $b_{i} \in \mathbb{F}_{3}$ on most of Altera or Xilinx FPGAs requires two 4 -input LUTs.
2) Multiplication over $\mathbb{F}_{3^{m}}$ : Among the many modular multipliers described in the open literature (see for instance [9], [16], [24]), we selected a Most Significant Element (MSE) first array multiplier based on Song \& Parhi's work [47] to carry out $a(x) b(x) \bmod f(x)$. At step $i$ we compute a degree- $(m+D-2)$ polynomial $t(x)$ which is the sum of $D$ partial products: $t(x)=\sum_{j=0}^{D-1} a_{D i+j} x^{j} b(x)$. A degree-$(m+D-1)$ polynomial $s(x)$, updated according to the celebrated Horner's rule, allows us to accumulate the partial products:

$$
s(x) \leftarrow t(x)+x^{D} \cdot(s(x) \bmod f(x))
$$

Thus, after $\lceil m / D\rceil$ steps, this algorithm returns a degree-$(m+D-1)$ polynomial $s(x)$ which is congruent to $a(x) b(x)$ modulo $f(x)$. The circuit described by Song \& Parhi requires dedicated hardware to compute $p(x)=s(x) \bmod f(x)$ [47]. We suggest to achieve the final modulo $f(x)$ reduction by performing an additional iteration with $a_{-j}=0,1 \leq j \leq D$. Since $t(x)$ is now equal to zero, we have: $s(x)=x^{D}$. $(a(x) b(x) \bmod f(x))$. Therefore, it suffices to consider the $m$ most significant coefficients of $s(x)$ to get the result (i.e. $\left.p(x)=s(x) / x^{D}\right)$. Algorithm 3 summarizes this multiplication scheme. Figure 1 describes the architecture of an array multiplier processing $D=3$ coefficients at each clock cycle.

```
Algorithm 3 MSE multiplication over \(\mathbb{F}_{3^{m}}\).
Input: A degree- \(m\) irreducible monic polynomial \(f(x)=\)
    \(x^{m}+f_{m-1} x^{m-1}+\ldots+f_{1} x+f_{0}\), two degree- \((m-1)\)
    polynomials \(a(x)\), and \(b(x)\). We assume that \(a_{-j}=0\),
    \(1 \leq j \leq D\). The algorithm requires a degree- \((m+D-1)\)
    polynomial \(s(x)\) as well as a degree- \((m+D-2)\) poly-
    nomial \(t(x)\) for intermediate computations.
Output: \(p(x)=a(x) b(x) \bmod f(x)\).
    \(s(x) \leftarrow 0 ;\)
    for \(i\) in \(\lceil m / D\rceil-1\) downto -1 do
        \(t(x) \leftarrow \sum_{j=0}^{D-1} a_{D i+j} x^{j} b(x) ;\)
        \(s(x) \leftarrow t(x)+x^{D} \cdot(s(x) \bmod f(x)) ;\)
    end for
    \(p(x) \leftarrow s(x) / x^{D} ;\)
```

The cost of the modular reduction (line 4) depends on $D$ and $f(x)$. Assume that $f(x)$ is an irreducible trinomial such


Fig. 1. MSE array multipliers processing $D=3$ coefficients at each clock cycle. Boxes with rounded corners involve only wiring.
that $f(x)=x^{m}+f_{n} x^{n}+f_{0}$, where $f_{0}$ and $f_{n} \in \mathbb{F}_{3}$, and $0<n<m$. We have:
$s(x) \bmod f(x)=\left(\sum_{i=0}^{D-1} s_{m+i} x^{m+i}+\sum_{i=0}^{m-1} s_{i} x^{i}\right) \bmod f(x)$.
Since $x^{m} \equiv-f_{n} x^{n}-f_{0}(\bmod f(x))$, we note that:

$$
s_{m+i} x^{m+i} \equiv s_{m+i}\left(-f_{n} x^{n}-f_{0}\right) x^{i} \quad(\bmod f(x))
$$

In the following, we assume that $D \leq m-n$ to ensure that the degree of $s_{m+i}\left(-f_{n} x^{n}-f_{0}\right) x^{i}, 0 \leq i \leq D-1$, is at most equal to $m-1$. Thus, we obtain:

$$
\begin{aligned}
s(x) \bmod f(x)= & \sum_{i=0}^{D-1} s_{m+i}\left(-f_{n} x^{n}-f_{0}\right) x^{i}+\sum_{i=0}^{m-1} s_{i} x^{i} \\
= & -\sum_{i=0}^{D-1} s_{m+i} f_{n} x^{n+i}-\sum_{i=0}^{D-1} s_{m+i} f_{0} x^{i} \\
& +\sum_{i=0}^{m-1} s_{i} x^{i}
\end{aligned}
$$

and the modular reduction involves $2 D$ additions (or subtractions) over $\mathbb{F}_{3}$. When $D \leq n$, the degree of $x^{i}, 0 \leq i \leq D-1$, is always smaller than the one of $x^{n+i}$ and the modular reduction requires a single stage of 2 -input adders (or subtracters) over $\mathbb{F}_{3}$. Thus, selecting the parameter $D$ such that $D \leq \min (n, m-n)$ allows one to achieve the shortest critical path in the case of an irreducible trinomial.

Let us consider for instance the irreducible trinomial $f(x)=$ $x^{97}+x^{12}+2$ (i.e. $m=97, n=12, f_{0}=2$, and $f_{12}=1$ ).

Since -2 is congruent to 1 modulo 3 , we have:
$s(x) \bmod f(x)=-\sum_{i=0}^{D-1} s_{97+i} x^{i+12}+\sum_{i=0}^{D-1} s_{97+i} x^{i}+\sum_{i=0}^{96} s_{i} x^{i}$. Figures 2 a and 2 b describe the circuits performing the modular reduction when $D=3$ and $D=13$, respectively. In the first case, a single stage of 2 -input adders allows one to carry out $s(x) \bmod f(x)$. However, in the second case, a 2 -input adder and a 2 -input subtracter are required to compute $s_{13}+s_{109}-$ $S_{97}$.

(a) $D=3$

(b) $D=13$

Fig. 2. Computation of $s(x) \bmod f(x)$ when $f(x)=x^{97}+x^{12}+2$ for (a) $D=3$ and (b) $D=13$.
3) Cubing over $\mathbb{F}_{3^{m}}$ : Let us now consider the computation of $b(x)=a(x)^{3}$ over $\mathbb{F}_{3^{m}}$. Cubing over $\mathbb{F}_{3^{m}}$ consists of reducing the following expression modulo $f(x)$ :

$$
b(x)=a(x)^{3}=\left(\sum_{i=0}^{m-1} a_{i} x^{3 i}\right) \bmod f(x) .
$$

A formal reduction allows us to express each coefficient $b_{i}$ of the result as a linear combination of the coefficient of $a(x)$. Therefore, a cubing operator mainly consists of a $D^{\prime}$-operand adder and some extra wiring to permute the coefficients of $a(x)$. The main challenge here is to find an irreducible polynomial minimizing $D^{\prime}$.

Let us consider again the irreducible trinomial $f(x)=x^{97}+$ $x^{12}+2$. Reducing $a(x)^{3}$ modulo $f(x)$, we obtain:

$$
\begin{array}{rlrl}
b_{0} & =a_{93}+a_{89}+a_{0}, & b_{2} & =a_{33}, \\
b_{1} & =a_{65}-a_{61}, & b_{3} & =a_{94}+a_{90}+a_{1}, \\
& b_{96} & =a_{32} .
\end{array}
$$

The most complex operation involved here is the addition of $D^{\prime}=3$ elements of $\mathbb{F}_{3}$. Since we consider a cube root-free $\eta_{T}$ pairing algorithm, $f(x)=x^{97}+x^{12}+2$ is a good candidate: it has a simple cubing formula and allows one to perform
the modulo $f(x)$ reduction involved in the multiplication over $\mathbb{F}_{3^{m}}$ by means of a single stage of 2 -input adders as long as $D \leq 12$. However, if one intends to implement a pairing algorithm with cube roots, one should consider a further constraint to select an irreducible trinomial. Barreto noticed that the cost of computing cube roots in $\mathbb{F}_{3^{m}}$ is only $O(m)$ if $m \equiv n(\bmod 3)[2]$. Despite of a slightly more complex cubing formula, $f(x)=x^{97}+x^{16}+2$ is for instance a better choice in this case.
4) Inversion over $\mathbb{F}_{3^{m}}$ : Since the computation of the reduced $\eta_{T}$ pairing involves a single inversion over $\mathbb{F}_{3^{m}}$ in the final exponentiation, we perform this operation according to Fermat's little theorem and Itoh \& Tsujii's algorithm [26]. Thus, inversion over $\mathbb{F}_{3^{m}}$ is carried out by means of cubings and multiplications over $\mathbb{F}_{3^{m}}$ and does not require specific hardware resources.

## B. Arithmetic over $\mathbb{F}_{3}{ }^{6 m}$

1) Cubing over $\mathbb{F}_{36 m}$ : When we compute the $\eta_{T}$ pairing according to Algorithm 1, we raise $R=r_{0}+r_{1} \sigma+r_{2} \rho+$ $r_{3} \sigma \rho+r_{4} \rho^{2}+r_{5} \sigma \rho^{2} \in \mathbb{F}_{3^{6 m}}$ to the cube at each iteration of the main loop. Since $\rho^{3}=\rho+b$ and $\sigma^{3}=-\sigma$, we obtain:

$$
\begin{aligned}
R^{3}= & \left(r_{0}^{3}+b r_{2}^{3}+r_{4}^{3}\right)+\left(-r_{1}^{3}-b r_{3}^{3}-r_{5}^{3}\right) \sigma \\
& +\left(r_{2}^{3}-b r_{4}^{3}\right) \rho+\left(-r_{3}^{3}+b r_{5}^{3}\right) \sigma \rho+r_{4}^{3} \rho^{2}-r_{5}^{3} \sigma \rho^{2}
\end{aligned}
$$

This operation involves six cubings and six additions (or subtractions) over $\mathbb{F}_{3^{m}}$.
2) Multiplication over $\mathbb{F}_{3^{6 m}}$ :
a) Full Multiplication over $\mathbb{F}_{3^{6 m} .:}$ Karatsuba-Ofman's algorithm allows one to compute the product of two polynomials belonging to $\mathbb{F}_{3^{6 m}}$ by means of 18 multiplications and 58 additions (or subtractions) over $\mathbb{F}_{3^{m}}$ (see for instance [31]). An improvement was recently proposed by Gorla et al. [20]: they represented elements of $\mathbb{F}_{36 m}$ as degree- 2 polynomials with coefficients in $\mathbb{F}_{3^{2 m}}$ and took advantage of Lagrange interpolation to compute a product over $\mathbb{F}_{3^{6 m}}$ by means of 5 multiplications over $\mathbb{F}_{3^{2 m}}$. Each of these multiplications is then carried out according to Karatsuba-Ofman's scheme, and the total cost of a multiplication over $\mathbb{F}_{3^{6 m}}$ is equal to 15 multiplications and 67 additions (or subtractions) over $\mathbb{F}_{3^{m}}$.
b) Sparse Multiplication over $\mathbb{F}_{3^{6 m} .:}$ Consider now the computation of the reduced $\eta_{T}$ pairing (Algorithm 1), where each iteration of the loop requires a sparse multiplication over $\mathbb{F}_{3^{6 m}}$. As pointed out by Bertoni et al. [5] and Granger et al. [23], the product $R \cdot S$ (line 11) can be computed by means of 13 multiplications and 50 additions (or subtractions) over $\mathbb{F}_{3^{m}}$ according to Karatsuba-Ofman's scheme. Again, the approach introduced by Gorla et al. allows one to further reduce the cost of this operation to 12 multiplications and 51 additions (or subtractions) over $\mathbb{F}_{3^{m}}$ (see [7] for details). Two further multiplications are needed to compute $y_{P} y_{Q}$ as well as $t^{2}$.

In this paper, we focus on parallel architectures featuring several multipliers. In this context, it seems more interesting to find a good trade-off between the number of multiplications and additions, to share registers between multipliers, and to reduce the number of accesses to memory. Let $R=r_{0}+$
$r_{1} \sigma+r_{2} \rho+r_{3} \sigma \rho+r_{4} \rho^{2}+r_{5} \sigma \rho^{2}$ and $C=c_{0}+c_{1} \sigma+c_{2} \rho+$ $c_{3} \sigma \rho+c_{4} \rho^{2}+c_{5} \sigma \rho^{2}$ be two elements of $\mathbb{F}_{3^{6 m}}$. We write each coefficient $c_{i}$ as the sum of two elements $c_{i}^{(0)}$ and $c_{i}^{(1)} \in \mathbb{F}_{3^{m}}$. Thanks to this notation we define the product $C=R \cdot\left(-t^{2}+\right.$ $\left.y_{P} y_{Q} \sigma-t \rho-\rho^{2}\right)$ as follows, where $b \in\{-1,1\}$ is a parameter of the elliptic curve:

$$
\begin{array}{ll}
c_{0}^{(0)}=-b r_{4} t-b r_{2}, & c_{0}^{(1)}=-r_{0} t^{2}-r_{1} y_{P} y_{Q} \\
c_{1}^{(0)}=-b r_{5} t-b r_{3}, & c_{1}^{(1)}=r_{0} y_{P} y_{Q}-r_{1} t^{2} \\
c_{2}^{(0)}=-r_{0} t-b r_{4}+b c_{0}^{(0)}, & c_{2}^{(1)}=-r_{2} t^{2}-r_{3} y_{P} y_{Q} \\
c_{3}^{(0)}=-r_{1} t-b r_{5}+b c_{1}^{(0)}, & c_{3}^{(1)}=r_{2} y_{P} y_{Q}-r_{3} t^{2} \\
c_{4}^{(0)}=-r_{2} t-r_{0}-r_{4}, & c_{4}^{(1)}=-r_{4} t^{2}-r_{5} y_{P} y_{Q} \\
c_{5}^{(0)}=-r_{3} t-r_{1}-r_{5}, & c_{5}^{(1)}=r_{4} y_{P} y_{Q}-r_{5} t^{2}
\end{array}
$$

Note that the computation of the $c_{i}^{(0)}$,s, $0 \leq i \leq 5$, requires six multiplications over $\mathbb{F}_{3^{m}}$ and depends neither on $t^{2}$ nor on $y_{P} y_{Q}$. Thus, we can perform eight multiplications over $\mathbb{F}_{3^{m}}$ in parallel $\left(t^{2}, y_{P} y_{Q}\right.$, and $\left.r_{i} t, 0 \leq i \leq 5\right)$. Consider now $c_{0}^{(1)}$ and $c_{1}^{(1)}$ and assume that $\left(r_{0}+r_{1}\right)$ and $\left(y_{P} y_{Q}-t^{2}\right)$ are stored in registers. Karatsuba-Ofman's algorithm allows one to compute $c_{0}^{(1)}$ and $c_{1}^{(1)}$ by means of three multiplications and three additions over $\mathbb{F}_{3^{m}}$ :

$$
\begin{aligned}
c_{0}^{(1)} & =-r_{0} t^{2}-r_{1} y_{P} y_{Q} \\
c_{1}^{(1)} & =\left(r_{0}+r_{1}\right)\left(y_{P} y_{Q}-t^{2}\right)+r_{0} t^{2}-r_{1} y_{P} y_{Q} \\
& =r_{0} y_{P} y_{Q}-r_{1} t^{2}
\end{aligned}
$$

Therefore, the computation of the $c_{i}^{(1)}$, s involves nine multiplications over $\mathbb{F}_{3^{m}}$, which can be carried out in parallel. Algorithm 4 summarizes this multiplication scheme involving 17 multiplications and 29 additions (or subtractions) over $\mathbb{F}_{3^{m}}$.

Since the computation of the nine products $p_{i}, 8 \leq i \leq 16$, depends on $p_{6}$ and $p_{7}$, we can not perform the 17 multiplications over $\mathbb{F}_{3^{m}}$ in parallel and have to proceed in two steps (Algorithm 4, lines 1 and 3). Therefore, we suggest to design a coprocessor embedding nine multipliers over $\mathbb{F}_{3^{m}}$, denoted by $M_{i}, 0 \leq i \leq 8$, in the following. A control unit will contain the instructions required to implement the sparse multiplication over $\mathbb{F}_{3^{6 m}}$ on such an architecture.

A careful scheduling allows one to share operands between up to three multipliers, thus saving hardware resources (Table I): during the first step ( 9 multiplications over $\mathbb{F}_{3^{m}}$ ), $M_{0}$, $M_{1}$, and $M_{2}$ respectively compute $r_{0} t, r_{2} t$, and $r_{4} t$. The MSE multiplier described in Section III-A2 stores its first operand in a shift register, and its second operand in a standard register. Since a shift register is more complex (an operand is loaded in parallel, and then shifted), we load the common operand $t$ in this component. At the end of these multiplications, the three registers still contain $r_{0}, r_{2}$, and $r_{4}$. Therefore it suffices to load $t^{2}$ in the shift register before starting the second step ( 9 multiplications over $\mathbb{F}_{3^{m}}$ ). Figure 3a describes the operator we designed to perform three multiplications with a common operand. The same architecture allows for computing $r_{1} t$, $r_{3} t, r_{5} t, r_{1} y_{P} y_{Q}, r_{3} y_{P} y_{Q}$, and $r_{5} y_{P} y_{Q}$. The five remaining multiplications involve a slightly more complex component (Figure 3b): two shift registers are required to compute $t^{2}$
and $y_{P} y_{Q}$ since there is no common operand. At the end of the first multiplication cycle, a dedicated subtracter computes $y_{P} y_{Q}-t^{2}$ and stores the result in the shift registers.

TABLE I
Sparse multiplication over $\mathbb{F}_{3}$ 苼: SChEDULING.

|  | 1st step: 8 multiplications <br> over $\mathbb{F}_{\mathbf{3}^{\mathbf{m}}}$ | 2nd step: 9 multiplications <br> over $\mathbb{F}_{\mathbf{3}^{\mathbf{m}}}$ |
| :--- | :---: | :---: |
| $\mathrm{M}_{0}$ | $p_{0}=r_{0} \cdot t$ | $p_{8}=r_{0} \cdot t^{2}$ |
| $\mathbf{M}_{1}$ | $p_{2}=r_{2} \cdot t$ | $p_{11}=r_{2} \cdot t^{2}$ |
| $\mathbf{M}_{2}$ | $p_{4}=r_{4} \cdot t$ | $p_{14}=r_{4} \cdot t^{2}$ |
| $\mathrm{M}_{3}$ | $p_{1}=r_{1} \cdot t$ | $p_{9}=r_{1} \cdot y_{P} y_{Q}$ |
| $\mathbf{M}_{4}$ | $p_{3}=r_{3} \cdot t$ | $p_{12}=r_{3} \cdot y_{P} y_{Q}$ |
| $\mathbf{M}_{5}$ | $p_{5}=r_{5} \cdot t$ | $p_{15}=r_{5} \cdot y_{P} y_{Q}$ |
| $\mathbf{M}_{6}$ | $p_{6}=t \cdot t$ | $p_{10}=\left(r_{0}+r_{1}\right) \cdot\left(y_{P} y_{Q}-t^{2}\right)$ |
| $\mathbf{M}_{7}$ | $p_{7}=y_{P} \cdot y_{Q}$ | $p_{13}=\left(r_{2}+r_{3}\right) \cdot\left(y_{P} y_{Q}-t^{2}\right)$ |
| $\mathbf{M}_{8}$ | - | $p_{16}=\left(r_{4}+r_{5}\right) \cdot\left(y_{P} y_{Q}-t^{2}\right)$ |

Consider the additions occurring in the fourth step of Algorithm 4. Interestingly enough, they involve at most one result of each block of three multipliers (Figure 3). Instead of a large multiplexer selecting the output of one multiplier among nine, we include a multiplexer in each block and connect a 3 -operand adder to the outputs of our multiplication units. In order to also take advantage of these adders while performing a multiplication, each block of three multipliers has an additional input D1 that allows for bypassing the multipliers.

## IV. Hardware Implementation

In this section, we propose two architectures to compute the reduced $\eta_{T}$ pairing for the field $\mathbb{F}_{3}[x] /\left(x^{97}+x^{12}+2\right)$ and the curve $y^{2}=x^{3}-x+1$ (i.e. $b=1$ ). This choice of parameters allows us to easily compare our work against the many pairing accelerators for $m=97$ described in the open literature. It is nonetheless important to note that the architectures and algorithms presented here can be easily adapted to different parameters.

## A. Hardware Accelerator for the Reduced $\eta_{T}$ Pairing

Figure 4 describes the architecture of our hardware accelerator for the $\eta_{T}$ pairing calculation (Algorithm 1). The ALU and the datapath are strongly related to the pairing algorithm and our sparse multiplication over $\mathbb{F}_{3^{m}}$ scheme. Nine multipliers over $\mathbb{F}_{3^{m}}$ sharing shift registers allow us to carry out the products $p_{i}, 0 \leq i \leq 16$, of our sparse multiplication scheme (Algorithm 4) in two steps, according to the scheduling summarized in Table I. The 3 -operand adder/subtracter allows for computing the $c_{i}$ 's. Recall that we raise the result of a sparse multiplication to the cube at the beginning of each iteration of the considered $\eta_{T}$ pairing algorithm. This operation consists of six cubings and six additions over $\mathbb{F}_{3^{m}}$ (Section III-B1). Therefore, we connected the output of the 3 -operand adder/subtracter to a cubing operator. This approach allows us to bypass the register file and to save clock cycles when raising to the cube over $\mathbb{F}_{3^{6 m}}$. Inputs and outputs, as well as intermediate results, are stored in a dual-ported RAM (DPRAM) implemented using embedded memory blocks available in the FPGA. The control unit mainly consists of a ROM containing the microcode of Algorithms 1

```
Algorithm 4 Sparse multiplication over \(\mathbb{F}_{3^{6 m}}\).
Input: \(R=r_{0}+r_{1} \sigma+r_{2} \rho+r_{3} \sigma \rho+r_{4} \rho^{2}+r_{5} \sigma \rho^{2} \in \mathbb{F}_{3^{6 m}} ; t, y_{P}\), and \(y_{Q} \in \mathbb{F}_{3^{m}}\); the parameter \(b \in\{-1,1\}\) of the
    supersingular elliptic curve.
Output: \(C=R \cdot\left(-t^{2}+y_{P} y_{Q} \sigma-r_{0} \rho-\rho^{2}\right)\).
```

1. Compute in parallel ( 8 multiplications and 3 additions over $\mathbb{F}_{3^{m}}$ ):

$$
\begin{array}{lll}
p_{i} \leftarrow r_{i} \cdot t, 0 \leq i \leq 5 ; & p_{6} \leftarrow t \cdot t ; & p_{7} \leftarrow y_{P} \cdot y_{Q} \\
s_{0} \leftarrow r_{0}+r_{1} ; & s_{1} \leftarrow r_{2}+r_{3} ; & s_{2} \leftarrow r_{4}+r_{5}
\end{array}
$$

2. Compute in parallel ( 7 additions over $\mathbb{F}_{3^{m}}$ ):

$$
\begin{array}{lllll}
s_{3} \leftarrow p_{7}-p_{6} ; \quad / / y_{P} y_{Q}-t^{2} & c_{2} \leftarrow b r_{4}+p_{0} ; & / / b r_{4}+r_{0} t & c_{4} \leftarrow r_{0}+p_{2} ; & / / r_{0}+r_{2} t \\
c_{0} \leftarrow b r_{2}+b p_{4} ; \quad / / b r_{2}+b r_{4} t & c_{3} \leftarrow b r_{5}+p_{1} ; & / / b r_{5}+r_{1} t & c_{5} \leftarrow r_{1}+p_{3} ; & / / r_{1}+r_{3} t \\
c_{1} \leftarrow b r_{3}+b p_{5} ; \quad / / b r_{3}+b r_{5} t & & &
\end{array}
$$

3. Compute in parallel ( 9 multiplications and 4 additions over $\mathbb{F}_{3^{m}}$ ):

$$
\begin{array}{rlll}
p_{8} \leftarrow r_{0} \cdot p_{6} ; & / / r_{0} t^{2} & p_{13} \leftarrow s_{1} \cdot s_{3} ; \quad / /\left(r_{2}+r_{3}\right)\left(y_{P} y_{Q}-t^{2}\right) & c_{2} \leftarrow c_{2}+b c_{0} ; \\
p_{9} \leftarrow r_{1} \cdot p_{7} ; \quad / / r_{1} y_{P} y_{Q} & p_{14} \leftarrow r_{4} \cdot p_{6} ; / / r_{4} t^{2} & c_{3} \leftarrow c_{3}+b c_{1} ; \\
p_{10} \leftarrow s_{0} \cdot s_{3} ; \quad / /\left(r_{0}+r_{1}\right)\left(y_{P} y_{Q}-t^{2}\right) & p_{15} \leftarrow r_{5} \cdot p_{7} ; / / r_{5} y_{P} y_{Q} & c_{4} \leftarrow c_{4}+r_{4} ; \\
p_{11} \leftarrow r_{2} \cdot p_{6} ; \quad / / r_{2} t^{2} & p_{16} \leftarrow s_{2} \cdot s_{3} ; \quad / /\left(r_{4}+r_{5}\right)\left(y_{P} y_{Q}-t^{2}\right) & c_{5} \leftarrow c_{5}+r_{5} ;
\end{array}
$$

$$
p_{12} \leftarrow r_{3} \cdot p_{7} ; \quad / / r_{3} y_{P} y_{Q}
$$

4. Compute in parallel ( 15 additions over $\mathbb{F}_{3^{m}}$ ):

$$
\begin{array}{rlr}
c_{0} \leftarrow-c_{0}-p_{8}-p_{9} ; & c_{2} \leftarrow-c_{2}-p_{11}-p_{12} ; & c_{4} \leftarrow-c_{4}-p_{14}-p_{15} ; \\
c_{1} \leftarrow-c_{1}+p_{10}+p_{8}-p_{9} ; & c_{3} \leftarrow-c_{3}+p_{13}+p_{11}-p_{12} ; & c_{5} \leftarrow-c_{5}+p_{16}+p_{14}-p_{15},
\end{array}
$$

and 4 . When $m=97$ and $D=3$, we need 4849 clock cycles to compute $\left.\eta_{T}(P, Q)\right)$ according to Algorithm 1.

Since algorithms for multiplication over $\mathbb{F}_{3^{3 m}}$ and $\mathbb{F}_{3^{6 m}}$ do not share operands between several multipliers, it turns out to be impossible to take advantage of the full parallelism of our architecture when performing the final exponentiation (Algorithm 2). Thus, it seems attractive to supplement the $\eta_{T}$ pairing accelerator with dedicated hardware to raise $\eta_{T}(P, Q)$ to the $M$ th power. Beuchat et al. [8] proposed a unified arithmetic operator performing addition, subtraction, accumulation, cubing, and multiplication over $\mathbb{F}_{3^{m}}$. When $m=97$ and $D=3$, this coprocessor performs the final exponentiation in 4082 clock cycles. We can therefore pipeline the computation of the $\eta_{T}$ pairing and the final exponentiation. In the following, we assume that we keep the pipeline busy and that we obtain a new result after 4849 clock cycles (i.e. we neglect the overhead introduced by our approach to get the first result). This coprocessor for the final exponentiation requires 64 registers to store elements of $\mathbb{F}_{3^{m}}$. On FPGA, they are efficiently implemented using the embedded memory blocks.

## B. A Coprocessor for Arithmetic over $\mathbb{F}_{3^{m}}$

We also investigated a second architecture based on a coprocessor for arithmetic over $\mathbb{F}_{3^{m}}$ embedding nine multipliers, an addition unit (able to carry out addition, subtraction, and accumulation), and a cubing unit (Figure 5). Since we implement the main loop of the $\eta_{T}$ pairing (Algorithm 1) and the final exponentiation (Algorithm 2) on the same hardware,
each multiplier must have two input registers and we cannot share shift registers between up to three multipliers over $\mathbb{F}_{3^{m}}$ anymore.

The sparse multiplications over $\mathbb{F}_{3^{6 m}}$ are carried out according to Algorithm 4. Since performing 15 or 18 multiplications over $\mathbb{F}_{3^{m}}$ requires the same number of clock cycles on our coprocessor, we implemented the multiplication over $\mathbb{F}_{3^{6 m}}$ of the final exponentiation according to Karatsuba-Ofman's scheme in order to minimize the number of additions over $\mathbb{F}_{3^{m}}$. When $m=97$ and $D=3$, the computation of $\eta_{T}(P, Q)$ and the final exponentiation require 6560 clock cycles and 2527 clock cycles, respectively.

This coprocessor for arithmetic over $\mathbb{F}_{3^{m}}$ is of course slower than the architecture described in the previous section when considering the computation of the $\eta_{T}$ pairing (Algorithm 1). However, it is much more versatile and allows for the implementation of a wider range of algorithms: besides pairing computation, it is for instance possible to perform a scalar multiplication, which is a crucial operation in pairing-based cryptography.

## V. Results and Comparisons

## A. FPGA Implementation

Our reduced $\eta_{T}$ pairing accelerator and the coprocessor for arithmetic over $\mathbb{F}_{3^{m}}$ were captured in the VHDL language and prototyped on Altera Cyclone II and Xilinx Virtex-II Pro FPGAs. Table II summarizes our place-and-route results.

Several processors for the reduced $\eta_{T}$ pairing (Table II) and the modified Tate pairing (Table III) have already been


Fig. 3. Building blocks for sparse multiplication over $\mathbb{F}_{36 m}$. (a) Three multipliers with a common operand. (b) Two multipliers with a common operand.


Fig. 4. Architecture of the coprocessor for the $\eta_{T}$ pairing calculation. The ALU embeds the building blocks for sparse multiplication over $\mathbb{F}_{3^{6 m}}$ described by Figure 3.
published. Since $\hat{e}(P, Q)$ can be computed from $\eta_{T}(P, Q)^{M}$ at almost no extra cost (Section II-C), we can compare our architectures against all these results. Note that the hardware accelerators proposed by other researchers are always implemented on Xilinx FPGAs. Therefore, we decided to compute the Area-Time (AT) product in terms of slices to provide the reader with a fair comparison (each slice of a Virtex-II, VirtexII Pro, or Virtex-4 embeds two 4-input function generators and


Fig. 5. Coprocessor for arithmetic over $\mathbb{F}_{3^{m}}$ amenable for pairing computation.
two storage elements). Note that register files implemented in memory blocks are not included in the AT product.

To our best knowledge, Jiang [27] designed the fastest $\eta_{T}$ pairing core (Table II). However, our processors achieves a better area-time trade-off. Additionally, our approach allows for reaching higher levels of security without risking to exhaust the FPGA resources. Jiang's coprocessor already requires one of the largest FPGAs available now.

In order to easily study the trade-off between calculation time and circuit area, Ronan et al. [41] wrote a C program which automatically generates a VHDL description of a coprocessor and its control according to the number of multipliers to be included and $D$. The ALU also embeds an adder, a subtracter, a cubing unit, and an inversion unit. Their fastest architecture embeds 8 multipliers $(D=4)$ and is very similar to the hardware accelerator for the reduced $\eta_{T}$ pairing proposed in Section IV-B. However, since our multipliers process $D=3$ coefficients at each clock cycles and the inversion over $\mathbb{F}_{3^{m}}$ is performed according to Fermat's

TABLE II
Hardware accelerators for the reduced $\eta_{T}$ Pairing (post-place-and-route figures). The parameter $D$ refers to the number of COEFFICIENTS PROCESSED AT EACH CLOCK CYCLE BY A MULTIPLIER.

|  | Curve | Technology | \# mult. | Area | Freq. <br> [ MHz ] | $\begin{gathered} \text { Calc. } \\ \text { time }[\mu \mathrm{s}] \end{gathered}$ | AT product |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Ronan et al. [43] | $C\left(\mathbb{F}_{2}{ }^{103}\right)$ | Virtex-II Pro 100 | $20(D=4)$ | 21021 slices | 51 | 206 | 4.33 |
|  |  |  | $20(D=8)$ | 24290 slices | 46 | 152 | 3.79 |
|  |  |  | $20(D=16)$ | 30464 slices | 41 | 132 | 4.02 |
| Ronan et al. [41] | $E\left(\mathbb{F}_{397}\right)$ | Virtex-II Pro 100 | $5(D=4)$ | 10540 slices | 84.8 | 187 | 1.97 |
|  |  |  | $8(D=4)$ | 15401 slices | 84.8 | 183 | 2.81 |
| Beuchat et al. [6] | $E\left(\mathbb{F}_{3}{ }^{97}\right)$ | Virtex-II Pro 20 | $1(D=3)$ | 1896 slices | 156 | 178 | 0.34 |
|  |  |  | $1(D=7)$ | 2711 slices | 128 | 117 | 0.32 |
|  |  |  | $1(D=15)$ | 4455 slices | 105 | 92 | 0.41 |
|  | $E\left(\mathbb{F}_{2^{239}}\right)$ | Virtex-II Pro 20 | $1(D=7)$ | 2366 slices | 199 | 196 | 0.46 |
|  |  |  | $1(D=15)$ | 2736 slices | 165 | 127 | 0.35 |
|  |  |  | $1(D=31)$ | 4557 slices | 123 | 107 | 0.49 |
| Jiang [27] | $E\left(\mathbb{F}_{397}\right)$ | Virtex-4 LX 200 | Not specified | 74105 slices | 77.7 | 20.9 | 1.55 |
| Coprocessor for the $\boldsymbol{\eta}_{\boldsymbol{T}}$ pairing \& coprocessor for the final exponentiation |  |  |  |  |  |  |  |
|  | $E\left(\mathbb{F}_{3}{ }^{97}\right)$ | Cyclone II EP2C35 | $9(D=3)$ | 18000 LEs | $149$ | $33$ | $-$ |
|  | $E\left(\mathbb{F}_{397}\right)$ | Virtex-II Pro 30 | $9(D=3)$ | 10897 slices | 147 | 33 | 0.36 |
| Coprocessor for arithmetic over $\mathbb{F}_{3}{ }^{m}$ - PairingLite |  |  |  |  |  |  |  |
| FPGA | $E\left(\mathbb{F}_{397}\right)$ | Virtex-II Pro 30 | $9(D=3)$ | 10262 slices | 142 | 64 | 0.66 |
|  |  | Cyclone II EP2C70 | $9(D=3)$ | 15293 LEs | 240 | 39.6 | - |
| ASIC | $E\left(\mathbb{F}_{397}\right)$ | $0.18 \mu \mathrm{~m}$ CMOS | $9(D=3)$ | 193765 NAND | 200 | 46.7 | - |

TABLE III
Hardware accelerators for the Tate pairing (post-place-and-route figures). The parameter $D$ refers to the number of COEFFICIENTS PROCESSED AT EACH CLOCK CYCLE BY A MULTIPLIER. THE ARCHITECTURE PROPOSED BY KÖMÜRCÜ \& SAVAS [33] DOES NOT implement the final exponentiation. Barenghi et al. [1] compute the Tate pairing over $\mathbb{F}_{p}$, Where $p$ is a 512 -bit prime number.

little theorem, we achieve a smaller area. Furthermore, thanks to our sparse multiplication algorithm, we compute the $\eta_{T}$ pairing in 6560 clock cycles, whereas Ronan et al. need 10089 clock cycles to complete the same task. They unrolled the exponent $M$ and grouped the inversions together. Their final exponentiation is therefore much more expensive than ours: 5440 clock cycles against 2527.

Grabher and Page designed a coprocessor dealing with $\mathbb{F}_{3^{m}}$ arithmetic, which is controlled by a general purpose processor [21]. Their hardware accelerator embeds a single multiplier over $\mathbb{F}_{3^{m}}$. Our architectures requires roughly 2.5 times as
many slices, while performing up to nine multiplications in parallel.

## B. ASIC Implementation

We designed the first ASIC implementation of the reduced $\eta_{T}$ pairing ( $0.18 \mu \mathrm{~m}$ CMOS technology). Our two hardware accelerators require roughly the same number of slices on Xilinx FPGAs. However, the architecture based on a coprocessor for the $\eta_{T}$ pairing and a coprocessor for the final exponentiation involves two register files. Since they are
implemented using the numerous memory blocks available in modern FPGAs, they are not taken into account in our area measurement. We decided to minimize the area of the chip and selected the coprocessor for arithmetic over $\mathbb{F}_{3^{m}}$ with $D=3$. Furthermore, this architecture is more versatile than the $\eta_{T}$ pairing accelerator described in Section IV-A. A simple modification of the control unit would allow us to support scalar multiplication in a new version of the ASIC. Table IV summarizes our place-and-route results. The PairingLite chip computes the reduced $\eta_{T}$ pairing (Algorithms 1 and 2) in $46.7 \mu \mathrm{~s}$. This timing includes the 52 and 78 clock cycles required to write the coordinates $P$ and $Q$ in the register file and to read the result, respectively.

TABLE IV
ASIC implementation of the reduced $\eta_{T}$ PAIRING (PLACE-AND-ROUTE FIGURES).

| Process | TSMC CL018G $(0.18 \mu \mathrm{~m}$ CMOS $)$ |
| :--- | :--- |
| Area | 193765 2NAND gates |
| Frequency | 200 MHz |
| Calculation time | $46.7 \mu \mathrm{~s}$ |
| Core size | $3849.6 \mu \mathrm{~m} \times 3849.6 \mu \mathrm{~m}$ |
| Package | TSMC CQFP 100 pin |
| Operating voltage | VDD CORE: $1.8 \mathrm{~V}, \mathrm{VDD} \mathrm{IO:} 3.3 \mathrm{~V}$ |
| Power consumption | Total power: 671.739 mW |
| Consumption current | Total current: 373.188 mA |
| Temperature conditions | $25^{\circ} \mathrm{C}$ |
| Output terminal | Drive capability 4 mA |

Figures 6 and 7 describe the evaluation board we designed to test the PairingLite ASIC (on the left on Figure 6). We also included a Cyclone II device (on the right on Figure 6) to test our FPGA architectures, and a true random number generator manufactured by FDK corporation to produce secret keys. A USB port allows one to connect the board to a computer. The figures reported in Table IV were measured using this board.

In order to check that it is possible to correctly compute the reduced $\eta_{T}$ pairing, we implemented the BLS short signature scheme [13]. The map-to-point function is computed in software. Then, the two pairings involved in the verification are performed in hardware on our evaluation board and in software on a desktop computer. We compare the results returned by the ASIC, the FPGA and the software. It takes 0.8 ms to send the coordinates of points $P$ and $Q$, compute the pairing on the ASIC, and read the result. Communications are clearly a bottleneck here, however, recall that the only purpose of our board is to serve as a prototype.

## VI. CONCLUSION

We proposed two parallel architectures to compute the reduced $\eta_{T}$ pairing in characteristic three and reported the first ASIC implementation of a pairing accelerator. Our coprocessors take advantage of a novel sparse multiplication algorithm over $\mathbb{F}_{3^{6 m}}$. Instead of minimizing the number of multiplications over $\mathbb{F}_{3^{m}}$, we tried to find a good trade-off between the number of multiplications and additions over $\mathbb{F}_{3^{m}}$. Our method also allows for sharing operands between up to three multipliers and reduces the number of accesses to memory compared to other algorithms.

Our next challenge is to design a pairing accelerator providing the level of security of AES-128. We plan to make a thorough comparison between supersingular curves over $\mathbb{F}_{2^{m}}$ and $\mathbb{F}_{3^{m}}$. We will consider several architectures: small processors based on a single unified operator [7], accelerators embedding several parallel-serial multipliers, and massively parallel architectures based on a Karatsuba-Ofman multiplier. The study of the Ate pairing [25] would also be of interest, for it presents a large speedup when compared to the Tate pairing and also supports non-supersingular curves. Once the best curve and architecture will be defined, we'd like to design a coprocessor for pairing-based cryptography supporting the most widely used primitives (e.g. pairing, random number generation, scalar multiplication, hashing, etc).

## ACKNOWLEDGMENT

The authors would like to thank the anonymous referees for their valuable comments. This work was supported by the New Energy and Industrial Technology Development Organization (NEDO), Japan.

## References

[1] A. Barenghi, G. Bertoni, L. Breveglieri, and G. Pelosi. A FPGA coprocessor for the cryptographic Tate pairing over $\mathbb{F}_{p}$. In Proceedings of the Fourth International Conference on Information Technology: New Generations (ITNG'08). IEEE Computer Society, 2008.
[2] P.S.L.M. Barreto. A note on efficient computation of cube roots in characteristic 3. Cryptology ePrint Archive, Report 2004/305, 2004.
[3] P.S.L.M. Barreto, S.D. Galbraith, C. Ó hÉigeartaigh, and M. Scott. Efficient pairing computation on supersingular Abelian varieties. Designs, Codes and Cryptography, 42:239-271, 2007.
[4] P.S.L.M. Barreto, H.Y. Kim, B. Lynn, and M. Scott. Efficient algorithms for pairing-based cryptosystems. In M. Yung, editor, Advances in Cryptology - CRYPTO 2002, number 2442 in Lecture Notes in Computer Science, pages 354-368. Springer, 2002.
[5] G. Bertoni, L. Breveglieri, P. Fragneto, and G. Pelosi. Parallel hardware architectures for the cryptographic Tate pairing. In Proceedings of the Third International Conference on Information Technology: New Generations (ITNG'06). IEEE Computer Society, 2006.
[6] J.-L. Beuchat, N. Brisebarre, J. Detrey, E. Okamoto, and F. RodríguezHenríquez. A comparison between hardware accelerators for the modified Tate pairing over $\mathbb{F}_{2^{m}}$ and $\mathbb{F}_{3^{m}}$. In Proceedings of Pairing 2008, Lecture Notes in Computer Science. Springer, 2008. To appear. An extended version is available as Report 2008/115 of the Cryptology ePrint Archive.
[7] J.-L. Beuchat, N. Brisebarre, J. Detrey, E. Okamoto, M. Shirase, and T. Takagi. Algorithms and arithmetic operators for computing the $\eta_{T}$ pairing in characteristic three. IEEE Transactions on Computers, 57(11):1454-1468, November 2008.
[8] J.-L. Beuchat, N. Brisebarre, M. Shirase, T. Takagi, and E. Okamoto. A coprocessor for the final exponentiation of the $\eta_{T}$ pairing in characteristic three. In C. Carlet and B. Sunar, editors, Proceedings of Waifi 2007, number 4547 in Lecture Notes in Computer Science, pages 25-39. Springer, 2007.
[9] J.-L. Beuchat, T. Miyoshi, J.-M. Muller, and E. Okamoto. Horner's rulebased multiplication over $\mathrm{GF}(p)$ and $\mathrm{GF}\left(p^{n}\right)$ : A survey. International Journal of Electronics, 2008. To appear.
[10] J.-L. Beuchat, M. Shirase, T. Takagi, and E. Okamoto. An algorithm for the $\eta_{T}$ pairing calculation in characteristic three and its hardware implementation. In P. Kornerup and J.-M. Muller, editors, Proceedings of the 18th IEEE Symposium on Computer Arithmetic, pages 97-104. IEEE Computer Society, 2007.
[11] D. Boneh and M. Franklin. Identity-based encryption from the Weil pairing. In J. Kilian, editor, Advances in Cryptology - CRYPTO 2001, number 2139 in Lecture Notes in Computer Science, pages 213-229. Springer, 2001.
[12] D. Boneh, C. Gentry, and B. Waters. Collusion resistant broadcast encryption with short ciphertexts and private keys. In V. Shoup, editor, Advances in Cryptology - CRYPTO 2005, number 3621 in Lecture Notes in Computer Science, pages 258-275. Springer, 2005.


Fig. 6. Evaluation board for the PairingLite chip.

13] D. Boneh, B. Lynn, and H. Shacham. Short signatures from the Weil pairing. In C. Boyd, editor, Advances in Cryptology - ASIACRYPT 2001, number 2248 in Lecture Notes in Computer Science, pages 514-532. Springer, 2001.
[14] R. Dutta, R. Barua, and P. Sarkar. Pairing-based cryptographic protocols: A survey. Cryptology ePrint Archive, Report 2004/64, 2004.
[15] I. Duursma and H.S. Lee. Tate pairing implementation for hyperelliptic curves $y^{2}=x^{p}-x+d$. In C. S. Laih, editor, Advances in Cryptology ASIACRYPT 2003, number 2894 in Lecture Notes in Computer Science, pages 111-123. Springer, 2003.
[16] S.E. Erdem, T. Yamk, and Ç.K. Koç. Polynomial basis multiplication over GF $\left(2^{m}\right)$. Acta Applicandae Mathematicae, 93(1-3):33-55, September 2006.
[17] K. Fong, D. Hankerson, J. López, and A. Menezes. Field inversion and point halving revisited. IEEE Transactions on Computers, 53(8):10471059, August 2004.
[18] G. Frey and H.-G. Rück. A remark concerning $m$-divisibility and the discrete logarithm in the divisor class group of curves. Mathematics of Computation, 62(206):865-874, April 1994.
[19] S.D. Galbraith, K. Harrison, and D. Soldera. Implementing the Tate pairing. In C. Fieker and D.R. Kohel, editors, Algorithmic Number Theory - ANTS V, number 2369 in Lecture Notes in Computer Science, pages 324-337. Springer, 2002.
[20] E. Gorla, C. Puttmann, and J. Shokrollahi. Explicit formulas for efficient multiplication in $\mathbb{F}_{36 m}$. In C. Adams, A. Miri, and M. Wiener, editors, Selected Areas in Cryptography - SAC 2007, number 4876 in Lecture Notes in Computer Science, pages 173-183. Springer, 2007.
[21] P. Grabher and D. Page. Hardware acceleration of the Tate pairing in characteristic three. In J. R. Rao and B. Sunar, editors, Cryptographic Hardware and Embedded Systems - CHES 2005, number 3659 in Lecture Notes in Computer Science, pages 398-411. Springer, 2005.
[22] R. Granger, D. Page, and N.P. Smart. High security pairing-based cryptography revisited. In F. Hess, S. Pauli, and M. Pohst, editors,

Algorithmic Number Theory - ANTS VII, number 4076 in Lecture Notes in Computer Science, pages 480-494. Springer, 2006.
[23] R. Granger, D. Page, and M. Stam. On small characteristic algebraic tori in pairing-based cryptography. LMS Journal of Computation and Mathematics, 9:64-85, March 2006.
[24] J. Guajardo, T. Güneysu, S. Kumar, C. Paar, and J. Pelzl. Efficient hardware implementation of finite fields with applications to cryptography. Acta Applicandae Mathematicae, 93(1-3):75-118, September 2006.
[25] F. Hess, N. Smart, and F. Vercauteren. The Eta pairing revisited. IEEE Transactions on Information Theory, 52(10):4595-4602, October 2006.
[26] T. Itoh and S. Tsujii. A fast algorithm for computing multiplicative inverses in $\mathrm{GF}\left(2^{m}\right)$ using normal bases. Information and Computation, 78:171-177, 1988.
[27] J. Jiang. Bilinear pairing (Eta_T Pairing) IP core. Technical report, City University of Hong Kong - Department of Computer Science, May 2007.
[28] A. Joux. A one round protocol for tripartite Diffie-Hellman. In W. Bosma, editor, Algorithmic Number Theory - ANTS IV, number 1838 in Lecture Notes in Computer Science, pages 385-394. Springer, 2000.
[29] M. Keller, T. Kerins, F. Crowe, and W.P. Marnane. FPGA implementation of a $\operatorname{GF}\left(2^{m}\right)$ Tate pairing architecture. In K. Bertels, J.M.P. Cardoso, and S. Vassiliadis, editors, International Workshop on Applied Reconfigurable Computing (ARC 2006), number 3985 in Lecture Notes in Computer Science, pages 358-369. Springer, 2006.
[30] M. Keller, R. Ronan, W.P. Marnane, and C. Murphy. Hardware architectures for the Tate pairing over $\operatorname{GF}\left(2^{m}\right)$. Computers and Electrical Engineering, 33(5-6):392-406, 2007.
[31] T. Kerins, W.P. Marnane, E.M. Popovici, and P.S.L.M. Barreto. Efficient hardware for the Tate pairing calculation in characteristic three. In J. R. Rao and B. Sunar, editors, Cryptographic Hardware and Embedded Systems - CHES 2005, number 3659 in Lecture Notes in Computer Science, pages 412-426. Springer, 2005.
[32] N. Koblitz and A. Menezes. Pairing-based cryptography at high security


Fig. 7. Architecture of the evaluation board.
levels. In N. P. Smart, editor, Cryptography and Coding, number 3796 in Lecture Notes in Computer Science, pages 13-36. Springer, 2005.
[33] G. Kömürcü and E. Savas. An efficient hardware implementation of the Tate pairing in characteristic three. In E. Prasolova-Førland and M. Popescu, editors, Proceedings of the Third International Conference on Systems - ICONS 2008, pages 23-28. IEEE Computer Society, 2008.
[34] S. Kwon. Efficient Tate pairing computation for elliptic curves over binary fields. In C. Boyd and J. M. González Nieto, editors, Information Security and Privacy - ACISP 2005, volume 3574 of Lecture Notes in Computer Science, pages 134-145. Springer, 2005.
[35] H. Li, J. Huang, P. Sweany, and D. Huang. FPGA implementations of elliptic curve cryptography and Tate pairing over a binary field. Journal of Systems Architecture, 54:1077-1088, 2008.
[36] A. Menezes, T. Okamoto, and S.A. Vanstone. Reducing elliptic curves logarithms to logarithms in a finite field. IEEE Transactions on Information Theory, 39(5):1639-1646, September 1993.
[37] V.S. Miller. Short programs for functions on curves. Available at http: //crypto.stanford.edu/miller, 1986.
[38] V.S. Miller. The Weil pairing, and its efficient calculation. Journal of Cryptology, 17(4):235-261, 2004.
[39] S. Mitsunari, R. Sakai, and M. Kasahara. A new traitor tracing. IEICE Trans. Fundamentals, E85-A(2):481-484, Feb 2002.
[40] C. Ó hÉigeartaigh. Pairing Computation on Hyperelliptic Curves of Genus 2. PhD thesis, Dublin City University, 2006.
[41] R. Ronan, C. Murphy, T. Kerins, C. Ó hÉigeartaigh, and P.S.L.M. Barreto. A flexible processor for the characteristic $3 \eta_{T}$ pairing. Int. J. High Performance Systems Architecture, 1(2):79-88, 2007.
[42] R. Ronan, C. Ó hÉigeartaigh, C. Murphy, M. Scott, and T. Kerins. FPGA acceleration of the Tate pairing in characteristic 2. In Proceedings of the IEEE International Conference on Field Programmable Technology - FPT 2006, pages 213-220. IEEE, 2006.
[43] R. Ronan, C. Ó hÉigeartaigh, C. Murphy, M. Scott, and T. Kerins. Hardware acceleration of the Tate pairing on a genus 2 hyperelliptic curve. Journal of Systems Architecture, 53:85-98, 2007.
[44] R. Sakai, K. Ohgishi, and M. Kasahara. Cryptosystems based on
pairing. In 2000 Symposium on Cryptography and Information Security (SCIS2000), Okinawa, Japan, pages 26-28, January 2000.
[45] C. Shu, S. Kwon, and K. Gaj. FPGA accelerated Tate pairing based cryptosystem over binary fields. In Proceedings of the IEEE International Conference on Field Programmable Technology - FPT 2006, pages 173180. IEEE, 2006.
[46] J.H. Silverman. The Arithmetic of Elliptic Curves. Number 106 in Graduate Texts in Mathematics. Springer-Verlag, 1986.
[47] L. Song and K.K. Parhi. Low energy digit-serial/parallel finite field multipliers. Journal of VLSI Signal Processing, 19(2):149-166, July 1998.
[48] E.R. Verheul. Evidence that XTR is more secure than supersingular elliptic curve cryptosystems. Journal of Cryptology, 17(4):277-296, 2004.
[49] L.C. Washington. Elliptic Curves - Number Theory and Cryptography. CRC Press, 2nd edition, 2008.


[^0]:    J.-L. Beuchat, A. Kanaoka, P. Ith, M. Mambo, and E. Okamoto are with the Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573, Japan.
    H. Doi is with the Graduate School of Information Security, Institute of Information Security, 2-14-1 Tsuruya-cho Kanagawa-ku, Yokohama 2210835, Japan.
    K. Fujita, M. Katouno, R. Soga, and H. Yamamoto are with FDK Module System Technology Corporation, 1 Kamanomae, Kamiyunagaya-machi, Jyoban, Iwaki-shi, Japan.
    A. Inomata is with the Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara, 6300192, Japan.
    T. Okamoto is with the Department of Computer Science, Tsukuba University of Technology, 4-12-7 Kasuga, Tsukuba, Ibaraki, 305-8521, Japan.
    T. Shiga and A. Vithanage are with FDK Corporation, 1 Kamanomae, Kamiyunagaya-machi, Jyoban, Iwaki-shi, Japan.
    M. Shirase and T. Takagi are with the School of Systems Information Science, Future University-Hakodate, 116-2 Kamedanakano-cho, Hakodate, Hokkaido, 041-8655, Japan.

    This work was supported by the New Energy and Industrial Technology Development Organization (NEDO), Japan. This paper is an extended version of [10].

[^1]:    ${ }^{1}$ See for instance Theorem V.3.1 in [46] for a definition.

[^2]:    ${ }^{2}$ The exponent is $\left(3^{6 m}-1\right) /\left(3^{3 m}+1\right)=3^{3 m}-1$.

