# Efficient Hardware Design for Computing Pairings Using Few FPGA In-built DSPs 

Riadh Brinci*, Walid Khmiri, Mefteh Mbarek, Abdellatif Ben Rabâa, Ammar<br>Bouallègue<br>*(Labortoire Sys'Com, Ecole Nationale d'Ingénieurs de Tunis, 1000 Tunis<br>Email: br.riadh@gmail.com)


#### Abstract

This paper is devoted to the design of a 258-bit multiplier for computing pairings over Barreto-Naehrig (BN) curves at 128 -bit security level. The proposed design is optimized for Xilinx field programmable gate array (FPGA). Each 258 -bit integer is represented as a polynomial with five, 65 bit signed integer, coefficients. Exploiting this splitting we designed a pipelined 65-bit multiplier based on new Karatsuba- Ofman variant using non-standard splitting to fit to the Xilinx embedded digital signal processor (DSP) blocks. We prototype the coprocessor in two architectures pipelined and serial on a Xilinx Virtex-6 FPGA using around 17000 slices and 11 DSPs in the pipelined design and 7 DSPs in the serial. The pipelined 128 -bit pairing is computed in 1.8 ms running at 225 MHz and the serial is performed in 2.2 ms running at 185 MHz . To the best of our knowledge, this implementation outperforms all reported hardware designs in term of DSP use.


Keywords-Cryptography, Field Programmable Gate Array (FPGA), Modular Multiplication, Non-Standard Splitting, Pairing-Friendly Curves

## 1 INTRODUCTION

A bilinear pairing is a map $G_{1} \times G_{2} \rightarrow G_{T}$ where $G_{1}$ and $G_{2}$ are typically additive groups and $G_{T}$ is a multiplicative group and the map is linear in each component. Many pairings used in cryptography such as the Tate pairing [1], R-ate pairing [2], ate pairing [3] and optimal pairings [4], choose $G_{1}$ and $G_{2}$ to be specific cyclic subgroups of $E\left(F_{p^{k}}\right)$, and $G_{T}$ to be a subgroup of $F^{*}{ }_{p}{ }^{k}$.

### 1.1 Ate pairing

Let $F_{p}$ be a finite field and let $E$ be an elliptic curve defined over $F_{p}$. Let $r$ be a large prime dividing $\# E\left(F_{p}\right)$ and k the embedding degree of $E\left(F_{p}\right)$ with respect to $r$, namely, the smallest positive integer $k$ such that $r \mid\left(p^{k}-1\right)$. For any finite extension field $K$ of $F_{p}$, denote with $E\left(F_{p}\right)[r]$ the $K$-rational $r$-torsion group of the curve. For $P \in E(K)$ and an integer $s$, let $O$ be the infinity point of $E$ and $f_{S, P}$ be a $K$-rational function or Miller function with divisor

$$
\left(f_{s, P}\right)=\mathrm{s}(\mathrm{P})-([\mathrm{s}] \mathrm{P})-(\mathrm{s}-1)(0)
$$

let $G_{1}=E\left(F_{p}\right)[r], G_{2}=E\left(F_{p^{k}}\right) \cap \operatorname{Ker}\left(\pi_{p}-[p]\right)$, where $\pi_{p}$ is the $p^{t h}$ power of Frobenius endomorphism;

$$
\begin{aligned}
& \pi_{p}: E \rightarrow E \\
& (x, y) \rightarrow\left(x^{p}, y^{p}\right)
\end{aligned}
$$

$$
\text { and } G_{T}=\mu_{r} \subset F^{*}{ }_{p^{k}}
$$

Let $P \in G_{1}, P \in G_{2}$ and $t=p+1-\# E(F p)$ be the trace of Frobenius, then,

$$
\alpha(P, Q)=\left(f_{t-1, Q}(P)\right)^{\left(p^{k}-1\right) / r}
$$

is non-degenerate bilinear, and computable pairing, it is the ate pairing

### 1.2 Pairing-Friendly Curves

An elliptic curve $E$ over $F_{p}$ is called pairingfriendly whenever there exists a large prime $r \mid \# E\left(F_{p}\right)$ with $r>\sqrt{ } p$ and the embedding degree $k$ is small enough, i.e. $k<\log _{2}(r) / 8$. Many construction methods result in a parametrized family of elliptic curves, i.e. $r$ and $p$ are given by the evaluation of polynomials $r(u)$ and $p(u)$ at an integer value $u$. One of the most important examples of such families are the Barreto- Naehrig (BN) curves [5], ideally suitable for implementing pairings at the 128 -bit security level. These curves have $k=12$ are defined by
$p(u)=36 u^{4}+36 u^{3}+24 u^{2}+6 u+1$
$r(u)=36 u^{4}+36 u^{3}+18 u^{2}+6 u+1$
for some $u \in Z$ such that $p$ is prime. We show that when we choose $u=2^{\tau}+s$, where $s$ is a reasonably small number, the modular multiplication in $F_{p}$ can be substantially improved.

The R -ate pairing [2] is a generalization of ate pairing and can be seen as an instantiation of optimal pairings [4]. Since the definition of the optimal ate pairing really depends on the particular elliptic curve one is using, we only provide the definition in the case of BN curves: using the same $G_{1}$ and $G_{2}$ as for ate
pairing, the optimal ate pairing on BN curves is defined as [6],

$$
\begin{aligned}
& \varrho(P, Q) \\
& =\left(f \cdot\left(f \cdot l_{a Q, Q}(P)\right)^{p} \cdot l_{\pi(a Q+Q), a Q}(P)\right)^{\left(p^{k}-1\right) / r} \\
& \text { where } a=6 u+2, f=f_{a, Q}(P) \text { and } l_{A, B} \text { denotes }
\end{aligned}
$$ the line through points $A$ and $B$

```
Algorithm 1Optimal Ate Pairing over BN Curves
Input: \(a=|6 u+2|=\sum_{i=0}^{S-1} a_{i} 2^{i}, \quad P \in E\left(F_{p}\right)[r]\),
    \(Q \in E\left(F_{p^{12}}\right)[r] \cap \operatorname{ker}\left(\pi_{p}-[p]\right)\)
Output: \(\rho(P, Q) \in F_{p^{12}}\)
ו. \(T \leftarrow Q, f \leftarrow 1\)
2.fori \(=s-2\) downto 0
    \(T \leftarrow 2 T, f \leftarrow f^{2} . l_{T, T}(P)\)
        ifa \(_{i}=1\) then
        \(T \leftarrow T+Q, f \leftarrow f . l_{T, Q}(P)\)
        end if
7. end for
s. \(f \leftarrow\left(f .\left(f . l_{a Q, Q}(P)\right)^{p} . l_{\pi(a Q+Q), a Q}(P)\right)^{\left(p^{k}-1\right) / r}\)
9. return \(f\)
```

Algorithm 1 used arithmetic in $F_{p^{12}}$ based on irreducible binomials through a tower of extensions. In our paper we present the towering scheme as:

$$
\begin{aligned}
& F_{p^{2}}=F_{p}[i] /\left(i^{2}-\beta\right), \text { where } \beta=-1 \\
& F_{p^{4}}=F_{p^{2}}[s] /\left(s^{2}-\xi\right), \text { where } \xi=i+1 \\
& F_{p^{12}}=F_{p^{4}}[t] /\left(t^{3}-s\right)=F_{p^{2}}[\tau] /\left(\tau^{6}-\xi\right),
\end{aligned}
$$

The choice of this towering $F_{p^{2}} \rightarrow F_{p^{4}} \rightarrow F_{p^{12}}$ makes the final exponentiation much cheaper than other choices. Also we choose $p \equiv 3 \bmod 4$ to accelerate arithmetic in $F_{p^{2}}$ since multiplication by $\beta=-1$ is simple subtraction. For BN curve we choose $E: y^{2}=$ $x^{3}+2$ and $u=-\left(2^{63}+857\right)$. This choice of curve parameters will simplify and speed up the reduction phases of the proposed Modular Integer Polynomial Montgomery Multiplier.

### 1.3 FPGA resources

FPGA manufacturers integrate more and more of dedicated function blocks into modern devices. For example, Xilinx Virtex-6 FPGAs include separate columns of additional function hard cores for memory (BRAM) and arithmetic DSP operations. The DSP blocks are grouped in pairs that span the height of four or five CLBs, respectively. The dual-ported BRAM matches the height of the pair of DSP blocks and supports a fast data path between memory and the DSP elements. Of particular interest is the use of these memory elements and DSP blocks for efficient Boolean and integer arithmetic operations with low signal propagation time. Large devices of Xilinx Virtex-6 class are equipped with up to thousand individual function blocks of these dedicated memory and arithmetic units. Originally, the integrated DSP blocks as indicated by their name were designed to
accelerate DSP applications, e.g., Finite Impulse Response (FIR) filters, etc. However, these arithmetic units can be programmed to perform universal arithmetic functions not limited to the scope of DSP filter applications; they support generic multiplication, addition and subtraction of (un)signed integers [7].

### 1.4 Outline

The remainder of this paper is organized as follows: Section II studies the most important existing works related to efficient hardware implementations of multiplication over $F_{p}$ suitable for computing pairings over BN curves. Section III introduces our hardware design of $65 \times 65$ bit multiplier based on DSP macro for Virtex-6 and performance comparison. Section IV focuses on the hardware implementation and performance comparison of our 258 bit multiplier. Finally, section V provides conclusion and future works.

## 2 Related Works

Since 2009, many hardware implementations of multiplication over $F_{p}$ suitable for computing pairings on BN curves was described. The first work was described by Fan et al. [8]. Their proposed architecture was based on Hybrid Montgomery Multiplier (HMM) where multiplication and reduction was interleaved. In same year, a new Application Specific Integrated Circuit (ASIC) implementation of pairings over BN curves was proposed. In 2010 Fan et al. [6] proposed a new pipelined and parallelized version of their HMM [8]. In 2011, Corona et al. [9] proposed a new hardware implementation of 258 bit multiplier suitable for computing pairings over BN curves. They used an asymmetric divide and conquer approach to efficiently implement their $65 \times 65$ bit multiplier. Their design used only 12 DSP slices on a Xilinx Virtex 6. In same year, Yao et al. [10] proposed a new hardware implementation of optimal ate pairing on Virtex 6. The design computed multiplication over $F_{p}$ using 32 DSP slices. They combined Lazy reduction with RNS representation.

## 3 Finite field Modular Multiplier

Multiplication is one of the major elements of a pairing coprocessor. In this paper, we propose a Modular Integer Polynomial Montgomery Multiplier (MIPMM) based on 5-Term Karatsuba. It is hybrid multiplier that computation is achieved in four dependent phases. In open literature there is many works proposed to efficient implement serialization of Karatsuba in pairing computation. Few papers presented implementation in prime fields [6], [8][9]. This paper focuses on the Karatsuba serialization in prime field $F_{p}$

Algorithm 2 Modular Integer Polynomial Montgomery Multiplier (MIPMM) for BN curves

$$
\begin{aligned}
& \text { Input: } \quad a(u)=\sum_{i=0}^{4} a_{i} u^{i}, b(u)=\sum_{i=0}^{4} b_{i} u^{i} \\
& \quad p(u)=36 u^{4}+36 u^{3}+24 u^{2}+6 u+1
\end{aligned}
$$

Output: $v(u) \equiv a(u) \cdot b(u) \cdot u^{-5} \bmod p(u)$

## Phase 1: Polynomial Multiplication

2. $c(u)=\underset{\substack{\sum_{1=0}^{8} c_{i} u^{i}=a(u) . b(u)}}{\substack{i \\ \text { computed by algorithm 3*/ }}}$

Phase 2: Partial coefficient reduction
4. for $i=0$ to 4
5. $q_{i}=c_{i} \operatorname{div} 2^{\tau} ; r_{i}=c_{i} \bmod 2^{\tau}$
6. $c_{i+1}=c_{i+1}+q_{i} ; \quad c_{i}=r_{i}-s . q_{i}$
7. end for
8. Phase 3: Polynomial reduction

```
.t \(u)=\left(-c_{4}+6\left(c_{3}-2 c_{2}-6\left(c_{1}-9 c_{0}\right)\right)\right) u^{4}\)
    \(+\left(-c_{3}+6\left(c_{2}-2 c_{1}-6 c_{0}\right)\right) u^{3}\)
    \(+\left(-c_{2}+6\left(c_{1}-2 c_{0}\right)\right) u^{2}\)
    \(+\left(-c_{1}+6 c_{0}\right) u\)
\(h(u)=36 t_{4} u^{3}\)
        \(+36\left(t_{4}+t_{3}\right) u^{2}\)
        \(+12\left(2 t_{4}+3\left(t_{3}+t_{2}\right)\right) u\)
        \(+6\left(t_{4}+4 t_{3}+6\left(t_{2}+t_{1}\right)\right)\)
    \(v(u)=c(u) / u^{5}+h(u)\)
    Phase 4: Coefficient reduction
    for \(i=0\) to 3
        \(q_{i}=v_{i} \operatorname{div} 2^{\tau} ; r_{i}=v_{i} \bmod 2^{\tau}\)
        \(v_{i+1}=v_{i+1}+q_{i} ; v_{i}=r_{i}-s . q_{i}\)
    end for
    return \(v(u)\)
```

This multiplier consists of 258 bits five terms Karatsuba multiplication which is constructed by one $65 \times 65$ bits multiplier basic core. There is registers units to save intermediate results to be used later. The main contributions of this paper are the efficient use of in-built features offered by modern FPGA devices: DSP, adders, subtractors, shift registers... We also designed two variants of basic 65x65 bits multiplier using respectively 7 and 11 DSP slices. Our proposed serialization exploits the independence in each phase and between phases to reduce the cycle count. Figure 1 depicts the top level of the proposed design of Modular Integer Polynomial Montgomery Multiplier for BN curves.


Figure 1 Top level of MIPMM

### 3.1 Five term Karatsuba Multiplier

The first work to compute Karatsuba using more than three terms was proposed by Peter Montgomery [11]. But it was not suitable for hardware implementation because the large number of addition and subtraction. Corona et al. [9] proposed new scheduling for addition and subtraction to fit hardware design. In this work we propose a more efficient scheduling to achieve one 258 bit multiplication in only 22 cycles. Algorithm 3 describes our Five term Karatsuba Multiplier. This multiplier is based on basic 65 bit multiplier core described in the next subsection.

```
Algorithm 3 Proposed Five Term Karatsuba
Input: \(\quad a(u)=\sum_{i=0}^{4} a_{i} u^{i}, b(u)=\sum_{i=0}^{4} b_{i} u^{i}\)
Output: \(c(u) \equiv \sum_{i=0}^{8} c_{i} u^{i}\)
1. \(p_{0}=a_{0} b_{0}\)
2. \(p_{1}=a_{1} b_{1} ; \quad n_{0}=a_{0}+a_{1} ; \quad n_{1}=b_{0}+b_{1}\)
3. \(p_{2}=n_{0} n_{1}\)
4. \(p_{3}=a_{2} b_{2} ; \quad n_{2}=a_{0}+a_{2} ; \quad n_{3}=b_{0}+b_{2}\)
5. \(p_{4}=n_{2} n_{3}\)
6. \(p_{5}=a_{3} b_{3} ; \quad n_{4}=a_{2}+a_{3} ; \quad n_{5}=b_{2}+b_{3}\)
7. \(p_{6}=n_{4} n_{5} ; \quad n_{6}=a_{3}+a_{1} ; \quad n_{7}=b_{3}+b_{1}\)
8. \(p_{7}=n_{6} n_{7} ; \quad n_{8}=n_{0}+n_{4} ; \quad n_{9}=n_{1}+n_{5}\)
9. \(p_{8}=n_{8} n_{9}\)
10. \(p_{9}=a_{4} b_{4} ; \quad n_{10}=a_{0}+a_{4} ; \quad n_{11}=b_{0}+b_{4}\)
11. \(p_{10}=n_{10} n_{11} ; \quad n_{12}=n_{0}+a_{4} ; \quad n_{13}=n_{1}+b_{4}\)
    2. \(p_{11}=n_{12} n_{13} ; \quad n_{14}=a_{2}+a_{4} ; \quad n_{15}=b_{2}+b_{4}\)
    \(p_{12}=n_{14} n_{15} ; \quad n_{16}=n_{14}+a_{3} ; \quad n_{17}=n_{15}+b_{3}\)
    \(p_{12}=n_{16} n_{17}\)
    \(c_{0}=p_{0}\)
    \(m_{0}=p_{1}+p_{0} ; \quad m_{1}=p_{1}-p_{0}\);
    \(c_{1}=p_{2}-m_{0} ; \quad m_{9}=p_{9}+m_{0}\)
    \(m_{11}=p_{1}+c_{1} ; \quad m_{14}=p_{9}+p_{3}\)
    \(m_{2}=p_{4}-p_{3} ; \quad m_{5}=p_{4}+c_{1}\)
    \(c_{2}=m_{2}+m_{1} ; \quad m_{3}=p_{3}+p_{5}\)
    \(s_{1}=p_{6}-m_{3} ; \quad m_{8}=m_{3}-m_{9}\)
    \(m_{7}=m_{5}+s_{1} ; \quad m_{12}=s_{1}-m_{11}\)
    \(m_{6}=p_{8}-p_{7}\)
    \(c_{3}=m_{6}-m_{7} ; \quad m_{10}=p_{10}+p_{7}\)
    \(c_{4}=m_{10}+m_{8} ; \quad m_{13}=p_{11}-p_{10}\)
    \(c_{5}=m_{13}-m_{11} ; \quad m_{15}=p_{12}+p_{5}\)
    \(c_{6}=m_{15}-m_{14} ; \quad m_{16}=p_{13}-s_{1}\)
    \(c_{7}=m_{16}-m_{15}\)
    return \(c(u)\)
```


### 3.2 Basic 65 bit multiplier core

The five term Karatsuba multiplier turned around this module. We propose two different architectures to perform asymmetric multiplication using common non-standard splitting technique. For instance, the asymmetric operands can be computed by the following equation with 25 and 18 bits to fit DSP core of the FPGA. We reserve the most significant bit as operand's sign. Let

$$
Z=A B=\left(\sum_{i=0}^{64} a_{i} 2^{i}\right)\left(\sum_{i=0}^{64} b_{i} 2^{i}\right)
$$

We decide to not perform full DSP computation for the last core $B_{51: 64} A_{0: 23}$ to avoid non-useful operations. So, we can write $Z=Z_{0}+2^{24} Z_{1}+$ $2^{48} Z_{2}$ as described in Figure 2 For example we can split the operands of $Z_{0}$ as

$$
\begin{aligned}
Z_{0}=B_{0: 16} A_{0: 23} & +2^{17}\left(B_{17: 33} A_{0: 23}\right. \\
& +2^{17}\left(B_{34: 50} A_{0: 23}\right. \\
& \left.\left.+2^{17}\left(B_{51: 64} A_{0: 23}\right)\right)\right)
\end{aligned}
$$

In this work we propose two designs called respectively full pipelined and serial architectures depicted respectively. In the first design, we propose three sets of DSP cores arrangement $\left(\mathrm{Z}_{0}, \mathrm{Z}_{1}\right.$ and $\left.\mathrm{Z}_{2}\right)$ : the two first sets have the same design and each set is performed by four DSP slices.


Figure 3 Proposed tilling
The last set is computed by only three slices. We used in the basic core three DSP parameterization detailed as eight $25 \times 18$ bit, two $25 \times 15$ and one $18 \times 18$ DSP configuration. This idea reduces the frequency of the multiplier but let us reduce power consumption by saving extra registers and non-useful operations. This first design achieves one $65 \times 65$ bit multiplication in seven cycles using eleven DSP cores.


Figure 4 Proposed hardware design of the first DSP set
The diagram in describes the delay constrained of the full pipelined architecture that takes seven cycles to achieve $65 \times 65$ bit multiplication.


Figure 5 delay constrained of the full pipelined architecture

In the second design, we rearrange operations to make $Z_{0}$ and $Z_{1}$ share the same hardware. So, we compute $Z_{0}$ and $Z_{2}$ in parallel. To get results at same time we added a pipeline stage in the DSP set of $Z_{2}$. At the $5^{\text {th }}$ we have our outputs. In the second cycle, we entered the operands of $Z_{1}$ to get it at the $6^{\text {th }}$ cycle. At this time we have also the result of the addition $S_{0}=Z_{0}+$ $2^{48} Z_{2}$. The adder is an in-built feature of the FPGA configured to give result after one cycle. Finally, we performed the last addition, configured also with latency one, giving full result after seven cycles using only seven DSP slices. In this second architecture we have added extra hardware finite state machine, multiplexers and demultiplexers.

### 3.3 Coefficient and polynomial reductions

The architecture described in Pipelined architecture
(b)Serial architecture

Figure 6 depicts the top level of each architecture shows the polynomial reduction phase. It can reduce the coefficients one by one taking twelve cycles to achieve the entire reduction. We performed multiplication by $s$ using shifts and addition. In this phases, the complexity can be reduced by exploiting the characteristics of the different constants. Since $s=2^{5}\left(2^{4}+2^{3}\right)+2^{6}+\left(2^{4}+2^{3}\right)+1$,
multiplication by $s$ is performed by three additions in three cycles. There is also multiplication by the following constants $6,9,12$ and36 computed by shifts and additions, e.g. $6 a=2^{2} a+2 a ; 9 a=$ $2^{3} a+a ; 12 a=2^{3} a+2^{2} a ; 36 a=2^{5} a+2^{2} a$;

### 3.4 Delay constrained of MIPMM

As mentioned before, the $65 \times 65$ bit multiplication takes seven cycles to achieve one multiplication. We get all partial products (PP) $p_{i \in\{0,13\}}$ shown in algorithm 3 after 13 cycles. However the delay of datapath for the post partial products combined with PP is two clock cycles. As result, five term Karatsuba gives the first output after 22 clock cycles. As soon as each $c_{i \in\{0,7\}}$ gets out from the pipelined PP core, it is scheduled on the fly to be partially reduced. This phase ends at the 22 second clock cycle. Other reduction phases also combined with phase one take 13 cycles Therefore, to sum up, the cost of the entire multiplier is 35 clock cycles.

(a) Pipelined architecture
(b) Serial architecture

Figure 6 Top level design of proposed pipelined and serial architectures

## 4 Pairing Design

Most operations in optimal pairing algorithm steps of are performed in $F_{p^{12}}$. Many techniques give efficient computation in extended fields with low complexity. We choose methods with minimum squaring and multiplication. The underlying operations are computed in base field. Therefore we design our coprocessor as scheduling of $F_{p}$ operations.

As shown in Algorithm 1, Miller loop phase consists of the following major operations:

- Doubling step, is the elliptic curve point doubling combined with the computation of the line $l$.
- Addition step, is the elliptic curve point addition combined with the computation of the line $l$.
- $\quad$ Squaring of the Miller function $f$.
- Spare multiplication of $f$ by $l$ having only half of non-zero coefficients.

We adopt homogenous coordinates proposed in [12] to efficient compute the different curve operations in the Miller loop. The listed above steps need arithmetic in $F_{p^{2}}$ such as multiplication and squaring. We propose Karatsuba method described by the following equation to compute multiplication.
$v_{0}=a_{0} b_{0} ; v_{1}=a_{1} b_{1}$
$c_{0}=v_{0}-v_{1}$
$c_{1}=\left(a_{0}+a_{1}\right)\left(b_{0}+b_{1}\right)-v_{0}-v_{1}$
We also refer to complex method to perform squaring in $F_{p^{2}}$. First, we precompute $v_{0}=a_{0} a_{1}$. Then, the square $c=a^{2}$ is computed as
$c_{0}=\left(a_{0}+a_{1}\right)\left(a_{0}-a_{1}\right)$ $c_{1}=2 v_{0}$

Multiplication and squaring operations need respectively 36 and 39 cycles to get out their results. To efficient compute Miller loop we made rearrangements and scheduling in each step to fit our
design. In the doubling step we have to compute three squaring and two multiplications which are equivalent to $12 F_{p}$ multiplications in the first part. They takes 48 cycles. In the second part, we have 17 $F_{p}$ multiplications giving results out in the $94^{\text {th }}$ cycle. The $f^{2}$ and $f . l$ need $111 \quad F_{p}$ multiplications computed in 3.885 cycles. To sum up, each Miller loop iteration takes 3.979 clock cycles. Using the same strategy in curve rearrangement addition step is achieved in 3,385 clock cycles. As result, Miller loop takes 277.000 cycles.

The final exponentiation consists of final addition and final exponentiation. Table 1 gives the different operation in this step and the cycle count.

| step | $F_{p}$ multiplications |
| :--- | :--- |
| Final Addition | 204 |
| $f^{p^{6}-1}$ | 579 |
| $f^{p^{6}+1}$ | 768 |
| $f^{p^{6}-p^{2}+1 / n}$ | 1813 |
| Others | 356 |
| Cycle count | 130.200 |

Table 1 Cycle count of the final exponentiation

## 5 RESULTS AND COMPARISON

The hole design has been done in VHDL using Xilinx ISE design suite on a Virtex-6 xc6vlx240t3 ff784 FPGA. It used in total 17560 slices, 7 and 11 DSP cores in our serial and pipelined architectures respectively. It runs at 185 Mhz and finishes pairing computation on BN curve at 128 bit security level in 2.2 ms .

Table 2 lists the performance hardware implementations reported in recent literature. Compared with the other hardware implementation [6] our design saves DSP cores

| Designs | Curve | Architecture | Target | Area | Frequency <br> MHz | Cycles <br> x10 | Delay <br> ms |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| This work | BN $_{128}$ | Pipelined | xc6vlx240t | 17560 slices, 11 DSP | 225 | 407 | 1.8 |
|  | BN $_{128}$ | Serial | xc6vlx240t | 14890 slices, 7 DSP | 185 | 407 | 2.2 |
| $[8]$ | BN $_{126}$ | HMM digit-serial | ASIC 130mm | 183 k Gates | 204 | 861 | 4.2 |
| $[6]$ | BN $_{128}$ | HMM parallel | xc6vlx240t | 4014 slices, 42 DSP | 210 | 245 | 1.17 |
| $[13]$ | BN $_{128}$ | Blakley | Xc4vlx200 | 52000 slices | 50 | 821 | 16.4 |
| $[14]$ | BN $_{126}$ | Montgomery | xc6vlx240t | 3813 slices, 144 DSP | 166 | 70 | 0.43 |
| $[15]$ | BN $_{126}$ | RNS (Parallel) | xc6vlx240t | 5237 slices, 64 DSP | 210 | 78 | 0.338 |

Table 2 Performance comparison of hardware implementations of pairings at around 128-bit security
until $90 \%$. Our goal is to keep the design of the pairing coprocessor full used by efficient resource sharing at high frequency. It is a serial implementation with minimum area use. The current design not only gains in in-built DSP slices with comparable pairing computation time but also shows that modern FPGA can be able to perform pairing with high complexity at higher security level on different friendly curves with large algebraic closure.

## 6 CONCLUSION

In this paper we introduce a new hardware design to efficiently serialize polynomial integer multiplication on BN curves over large prime field. Due to deep arrangement and careful scheduling of different steps of the coprocessor our design saves $90 \%$ of DSP slices and achieves one pairing computation in 1.8 ms . Our future work will be the multi-pairing computation to respond faster to many client requests. We plan also to implement other curves and different types of pairings on this architecture. Furthermore, we will provide an optimal parameter set and pairing implementations for higher security level including 192-bit or 256-bit security.

## REFERENCES

[1] P. S. L. M. Barreto, H. Y. Kim, B. Lynn, and M. Scott. Efficient Algorithms for Pairing Based Cryptosystems. CRYPTO 2002, volume 2442 of Lecture Notes in Computer Science, pages 354-368. Springer, 2002.
[2] E. Lee, H. S. Lee, and C. M. Park. Efficient and Generalized Pairing Computation on Abelian Varieties. Cryptology ePrint Archive, Available from http://eprint.iacr.org/. Report 2009/040.
[3] F. Hess, N. P. Smart, and F. Vercauteren. The Eta Pairing Revisited. IEEE Transactions on Information Theory, 52(10), pages 459-462, Oct. 2006.
[4] F. Hess. Pairing Lattices. Pairing 2008, volume 5209 of Lecture Notes in Computer Science, pages 18-38. Springer, 2008.
[5] P. Barreto and M. Naehrig. Pairing-friendly elliptic curves of prime order. Selected Areas in

Cryptography, SAC 2005, LNCS 3897, pages 319331, 2006.
[6] J. Fan, F. Vercauteren, and I. Verbauwhede. Efficient hardware implementation of Fp -arithmetic for pairing-friendly curves. Computers, IEEE Transactions on, 61(5), 2012, 676-685.
[7] T.Güneysu. Utilizing hard cores of modern FPGA devices for high-performance cryptography, Journal of Cryptographic Engineering, 1(1), 2011, 37-55.
[8] J. Fan, F. Vercauteren, and I. Verbauwhede. Faster Fp Arithmetic for Cryptographic Pairings on BarretoNaehrig Curves. CHES 2009, volume 5747 of Lecture Notes in Computer Science, pages 240-253. Springer, 2009.
[9] C. Corona, C., Moreno, E. F., \& Henriquez, F. R. Hardware design of a 256-bit prime field multiplier suitable for computing bilinear pairings. International Conference on Reconfigurable Computing and FPGAs (ReConFig 2011), 2011, 229234.
[10] G. X. Yao, J. Fan, R. C. Cheung, and I. Verbauwhede. A high speed pairing coprocessor using RNS and lazy reduction. Cryptology ePrint Archive, Available from http://eprint.iacr.org/. Report 2011/258, 2011
[11] P. L. Montgomery. Five, six, and seven-term Karatsuba-like formulae, IEEE Transactions on Computers, vol. 54(3), 362-369
[12] Costello, C., Lange, T., \& Naehrig, M. (2010). Faster pairing computations on curves with highdegree twists. In Public Key Cryptography-PKC 2010 (pp. 224-242). Springer Berlin Heidelberg.
[13] Ghosh, S., Mukhopadhyay, D., \& Roychowdhury, D. (2010). High speed flexible pairing cryptoprocessor on FPGA platform. Pairing-Based Cryptography-Pairing 2010 (pp. 450-466). Springer Berlin Heidelberg.
[14] Ghosh, S., Verbauwhede, I., \& Roychowdhury, D. (2013). Core based architecture to speed up optimal ate pairing on FPGA platform. In Pairing-Based Cryptography-Pairing 2012 (pp. 141-159). Springer Berlin Heidelberg.
[15] Yao, G. X., Fan, J., Cheung, R. C., \& Verbauwhede, I. (2013). Faster pairing coprocessor architecture. In Pairing-Based Cryptography-Pairing 2012 (pp. 160-176). Springer Berlin Heidelberg.

