# Pushing the Limit of Vectorized Polynomial Multiplications for NTRU Prime 

Vincent Hwang<br>Max Planck Institute for Security and Privacy, Bochum, Germany<br>vincentvbh7@gmail.com


#### Abstract

We conduct a systematic examination of vector arithmetic for polynomial multiplications in software. Vector instruction sets and extensions typically specify a fixed number of registers, each holding a power-of-two number of bits, and support a wide variety of vector arithmetic on registers. Programmers then try to align mathematical computations with the vector arithmetic supported by the designated instruction set or extension. We delve into the intricacies of this process for polynomial multiplications. In particular, we introduce "vectorizationfriendliness" and "permutation-friendliness", and review "Toeplitz matrixvector product" to systematically identify suitable mappings from homomorphisms to vectorized implementations. To illustrate how the formalization works, we detail the vectorization of polynomial multiplication in $\mathbb{Z}_{4591}[x] /\left\langle x^{761}-x-1\right\rangle$ used in the parameter set sntrup761 of the NTRU Prime key encapsulation mechanism. For practical evaluation, we implement vectorized polynomial multipliers for the ring $\mathbb{Z}_{4591}[x] /\left\langle\Phi_{17}\left(x^{96}\right)\right\rangle$ with AVX2 and Neon. We benchmark our AVX2 implementation on Haswell and Skylake and our Neon implementation on Cortex-A72 and the "Firestorm" core of Apple M1 Pro. Our AVX2-optimized implementation is 1.99-2.16 times faster than the state-of-the-art AVX2-optimized implementation by [Bernstein, Brumley, Chen, and Tuveri, USENIX Security 2022] on Haswell and Skylake, and our Neon-optimized implementation is 1.29-1.36 times faster than the state-of-the-art Neon-optimized implementation by [Hwang, Liu, and Yang, ACNS 2024] on Cortex-A72 and Apple M1 Pro. For the overall scheme with AVX2, we reduce the batch key generation cycles (amortized with batch size 32 ) by $7.9 \%-12.0 \%$, encapsulation cycles by $7.1 \%-10.3 \%$, and decapsulation cycles by $10.7 \%-13.3 \%$ on Haswell and Skylake. For the overall performance with Neon, we reduce the encapsulation cycles by $3.0 \%-6.6 \%$ and decapsulation cycles by $12.8 \%$ - 15.1\% on Cortex-A72 and Apple M1 Pro.


Keywords: Vectorization • Polynomial Multiplication • Fast Fourier Transform - NTRU Prime

## 1 Introduction

At PQCrypto 2016, the National Institute of Standards and Technology called for post-quantum cryptography to replace existing public-key cryptography due
to the discovery of polynomial time quantum algorithms solving integer factorization and discrete logarithm [23]. Among the candidates, lattice-based cryptosystems are the most popular due to their balanced key sizes, ciphertext size, signature size, and performance cycles. In many lattice-based cryptosystems, polynomial multiplication is one of the dominating operations for the performance cycles. While constructing cryptosystems, cryptographers choose between various polynomial rings to balance between performance cycles and various security notions. This paper aims to improve public understanding of the interactions between the uses of vector arithmetic and the algebraic aspects of the polynomial rings.

Vector instruction sets and extensions typically specify a fixed number of vector registers, each holding power-of-two number of bits, and support a variety of vector arithmetic operating on these registers. Programmers then try to map the mathematical computations to strings of vector arithmetic supported by the target instruction set or extension. We thoroughly investigate this process for polynomial multiplications in lattice-based cryptosystems. There are two questions we wish to answer in this paper:

1. Why homomorphisms defined on polynomial rings with power-of-two-multiple number of coefficients are frequently assumed to admit efficient vectorization processes?
2. Which homomorphisms are suitable for vectorization?

We answer the first question as follows. In a vector instruction set or extension, there are usually component-wise addition, subtraction, multiplication and variants. We formalize the notion vectorization-friendliness and explain why homomorphisms resulting in small-dimensional power-of-two size polynomial multiplications can be suitably mapped to component-wise arithmetic. After decomposing a large problem into several small problems, we divide vector instruction sets and extensions into two groups by the presence of vector-by-scalar multiplication instructions. An instruction is called vector-by-scalar multiplication instruction if it multiplies all the components of a vector by a scalar and returns a vector of elements. If there are vector-by-scalar multiplication instructions, we explain that if the remaing polynomial multiplications are Toeplitz matrix-vector products, then vectorization-friendliness suffices to justify suitable vectorization of the overall transformation. On the other hand, if there are no vector-by-scalar multiplication instructions, we formalize the notion permutation-friendliness and relate it to the power-of-two nature of the number of subproblems.

For the second question, an evident example is the radix-2 Cooley-Tukey fast Fourier transformation (FFT). Recent work [5] showed that radix-2 Schönhage's and Nussbaumer's FFTs, built upon the power-of-two cyclotomic polynomial moduli, are convenient ones when radix-2 Cooley-Tukey FFT cannot be defined over the native coefficient ring. However, Schönhage's and Nussbaumer's FFTs double the number of coefficients for each application, eventually leading to more small-dimensional polynomial multiplications than the traditional

Cooley-Tukey FFT. More recently, [14] proposed Rader's FFT for large Fermat-prime-size transformation and radix-2 Bruun's FFT for the small-dimensional transformation and removed the growth of the number of coefficients under the vectorization context. The downside is that the Fermat-prime-size transformation from Rader's FFT does not nicely align with the power-of-two nature of vectorization. We identify that truncated Rader's FFT over Fermat-prime-size cyclotomic polynomial moduli, previously used for computing the norm of an abelian extension with prime conductor [3, Section 4.8], is a suitable one for vectorization due to the power-of-two nature of the transformation size.

Contributions. We summarize our contributions as follows.

- We formalize vectorization-friendliness capturing the nature of componentwise arithmetic supported by a vector instruction set or extension.
- If there are vector-by-scalar multiplication instructions, we explain that vectorization-friendly transformations resulting in small-dimensional Toeplitz matrix-vector products are suitable for vectorization.
- On the other hand, if there are no vector-by-scalar multiplication instructions, we formalize permutation-friendliness capturing the power-of-two nature of the number of subproblems.
- We implement our polynomial multipliers in AVX2 and Armv8.0-A Neon for the ring $\mathbb{Z}_{4591}[x] /\left\langle\Phi_{17}\left(x^{96}\right)\right\rangle$ implementing polynomial multiplications in the NTRU Prime parameter set sntrup761.
- For the polynomial multiplication, our AVX2 implementation outperforms the state-of-the-art AVX2-optimized implementation from [5] by $1.99 \times$ on Haswell and $2.16 \times$ on Skylake, and our Neon implementation outperforms the state-of-the-art Neon-optimized implementation from [14] by $1.29 \times$ on Cortex-A72 and $1.36 \times$ on Apple M1 Pro.
- For the overall scheme, we integrate our AVX2 implementation into the package libsntrup761 provided by [5] for the batch key generation and the package supercop for encapsulation and decapsulation. We reduce the amortized cycles of batch key generation (with batch size 32) by $7.9 \%-12.0 \%$, encapsulation cycles by $7.1 \%-10.3 \%$, and decapsulation cycles by $10.7 \%-$ $13.3 \%$ on Haswell and Skylake. As for our Neon implementation, we integrate our Neon code into the artifact provided by [14]. Our Neon implementation reduces encapsulation cycles by $3.0 \%-6.6 \%$ and decapsulation cycles by $12.8 \%-15.1 \%$ on Cortex-A72 and Apple M1 Pro.

Related works. There is a long list of works related to vectorization and its challenges. The most relevant one is SPIRAL by [13]. They had attempted to formalize the vectorization of FFTs for code generation. However, SPIRAL falls short to cover transformations used in this paper and we believe this paper will give more insights on extending SPIRAL. Regarding polynomial multiplications for NTRU Prime, $[1,2,4]$ discussed polynomial multiplications when one of the operands has coefficients drawn from $\{0, \pm 1\}$, [1] discussed the generic polynomial multiplication with fairly limited support of vectorization, [5,14] discussed
the generic polynomial multiplication with high-dimensional vectorization support.

Code. Our source code is publicaly avaiable at
https://github.com/vector-polymul-ntru-ntrup/NTRU_Prime_truncation under CC0 license.

Structure of this paper. This paper is structured as follows. Section 2 goes through the preliminaries. Section 3 formalizes the vectorization process, Section 4 gives a walkthough on vectorizing polynomial multiplications for sntrup. Finally, Section 5 shows the performance with AVX2 on Haswell and Skylake, and with Neon on Cortex-A72 and Apple M1 Pro.

## 2 Preliminaries

We go through some preliminaries for this paper. Section 2.1 describes the target polynomial arithmetic of NTRU Prime, Section 2.2 reviews some basics from algebra with emphasis on tensor of module homomorphisms, and Section 2.3 describes vector instruction set architectures and extensions.

### 2.1 Streamlined NTRU Prime

NTRU Prime [4] is an alternate candidate of key encapsulation mechanism (KEM) in the 3rd round of NIST Post-Quantum Cryptography (PQC) Standardization and is currently used in OpenSSH 9.0 hybrid sntrup761x25519sha512 key exchange by default ${ }^{1}$. NTRU prime KEM [4] operates over the polynomial rings $\mathbb{Z}_{q}[x] /\left\langle x^{p}-x-1\right\rangle$ and $\mathbb{Z}_{3}[x] /\left\langle x^{p}-x-1\right\rangle$ for primes $p$ and $q$ such that $\mathbb{Z}_{q}[x] /\left\langle x^{p}-x-1\right\rangle \cong \mathbb{F}_{q^{p}}$. There are two cryptosystems built upon $\mathbb{Z}_{q}[x] /\left\langle x^{p}-x-1\right\rangle$ and $\mathbb{Z}_{3}[x] /\left\langle x^{p}-x-1\right\rangle$ - Streamlined NTRU Prime (sntrup) and NTRU LPRime (ntrulpr). This paper focuses on polynomial multiplications in sntrup761 with $(p, q)=(761,4591)$ and the implementations can be straightforwardly ported into ntrulpr761. See [4] for more details of the scheme. In the following, we list all the polynomial multiplications and inversions required for sntrup.

Key generation: We need one inversion in $\mathbb{Z}_{q}[x] /\left\langle x^{p}-x-1\right\rangle$ and one inversion with invertibility check in $\mathbb{Z}_{3}[x] /\left\langle x^{p}-x-1\right\rangle$ for the secret key, and one polynomial multiplication in $\mathbb{Z}_{q}[x] /\left\langle x^{p}-x-1\right\rangle$ for the public key.
Encapsulation: We need one polynomial multiplication in $\mathbb{Z}_{q}[x] /\left\langle x^{p}-x-1\right\rangle$ for encryption.
Decapsulation: We need one polynomial multiplication in $\mathbb{Z}_{q}[x] /\left\langle x^{p}-x-1\right\rangle$ for encryption, and one polynomial multiplication in $\mathbb{Z}_{q}[x] /\left\langle x^{p}-x-1\right\rangle$ and one polynomial multiplication in $\mathbb{Z}_{3}[x] /\left\langle x^{p}-x-1\right\rangle$ for decryption.

[^0]We focuses on polynomial multiplications and inversions in $\mathbb{Z}_{q}[x] /\left\langle x^{p}-x-1\right\rangle$. We call a polynomial multiplication "big-by-small" if one of the operands is drawn from $\{0, \pm 1\}$ and "big-by-big" otherwise. Our polynomial multipliers target big-by-big ones and also covers the big-by-small ones by definition. For encapsulation and decapsulation, we only need big-by-small polynomial multiplications. For the key generation, we only need big-by-small polynomial multiplication outside the inversion. As for the inversion, the requirement of polynomial multiplications heavily depends on the choice of approach. We simply focus on the divstep approach avoiding any polynomial multiplications and leave the incorporation of jumpdivstep [7] as future work.

To see why big-by-big polynomial multiplications are important, we review Montgomery's trick for batch inversion used in batch key generation [5]. Let's say we want to invert polynomials $\boldsymbol{a}_{0}, \ldots, \boldsymbol{a}_{n-1}$ in $\mathbb{Z}_{q}[x] /\left\langle x^{p}-x-1\right\rangle$. Instead of inverting each of them one at a time, we first compute $\boldsymbol{a}_{0}, \boldsymbol{a}_{0} \boldsymbol{a}_{1}, \ldots, \prod_{i=0, \ldots, n-1} \boldsymbol{a}_{i}$ with $n-1$ polynomial multiplications, and invert $\prod_{i=0, \ldots, n-1} \boldsymbol{a}_{i}$. We now compute $\left(\prod_{i=0, \ldots, n-1} \boldsymbol{a}_{i}\right)^{-1},\left(\prod_{i=0, \ldots, n-2} \boldsymbol{a}_{i}\right)^{-1}, \ldots, \boldsymbol{a}_{0}^{-1}$ with $n-1$ polynomial multiplications. Finally, we iterate over $j=1, \ldots, n-1$ and compute the inverses as $\boldsymbol{a}_{j}^{-1}=\left(\prod_{i=0, \ldots, j} \boldsymbol{a}_{i}\right)^{-1}\left(\prod_{i=0, \ldots, j-1} \boldsymbol{a}_{i}\right)$. In sntrup, since all polynomials to be inverted have coefficients in $\{0, \pm 1\}$, we need $2 n-2$ big-by-small polynomial multiplications, $n-1$ big-by-big polynomial multiplications, and one inversion.

### 2.2 Basics of Algebra

We first go through some basic notations and definitions of algebraic structures for this paper. Readers familiar with modules and associative algebras are free to skip this section and treat this section as a reference. We assume that readers are all familiar with monoids, groups, rings, and modules and refer to standard algebra books $[9,16,17]$ for reference. In this paper, all rings are commutative and unital. Below we go through a short introduction of free modules and associative algebras over a commutative ring $R$.

Modules. The central idea of this paper revolves around free-module homomorphisms and their tensor products. A module is a generalization of a vector space where the underlying ground field is relaxed to a ring. We only consider a special kind of modulues - free modulues of finite ranks. In other words, all modulues in this paper are of the form $R^{n}$ for a ring $R$ and a positive integer $n$, and all elements can be written as a finite sum of the form $\sum_{i} r_{i} e_{i}$ where $e_{i}$ is the element with $i$ th element one and zero elsewhere for all $i$. Given two free modules $R^{n}$ and $R^{m}$, we define the tensor product of $R^{n}$ and $R^{m}$ as the free module consisting of all the elements of the form $\sum_{i} \boldsymbol{a}_{i} \otimes \boldsymbol{b}_{i}$. where $\boldsymbol{a}_{i} \in R^{n}$ and $\boldsymbol{b}_{i} \in R^{m}$ up to certain equivalences.

Suppose we have module homomorphisms $f: R^{n} \rightarrow R^{n}$ and $g: R^{m} \rightarrow R^{m}$. We define the tensor product $f \otimes g: R^{n} \otimes R^{m} \rightarrow R^{n} \otimes R^{m}$ of $f$ and $g$ as

$$
x \otimes y \mapsto f(x) \otimes g(y)
$$

Recall that module homomorphism between modules of finite ranks can be written as matrix multiplications if we specify the bases. Suppose we have bases $\left\{e_{i}\right\} \subset R^{n}$ and $\left\{\tilde{e}_{j}\right\} \subset R^{m}$. Then, $\left\{e_{i} \otimes \tilde{e}_{j}\right\}$ is a basis of $R^{n} \otimes R^{m}$. One can show that the matrix form of $f \otimes g$ with the basis $\left\{e_{i} \otimes \tilde{e}_{j}\right\}$ is the same as the tensor product of the matrix forms of $f$ with $\left\{e_{i}\right\}$ and $g$ with $\left\{\tilde{e}_{j}\right\}$.

By unfolding the definition of a tensor product, we have

$$
\forall f_{0}, f_{1}: R^{n} \rightarrow R^{n}, \forall g_{0}, g_{1}: R^{m} \rightarrow R^{m},\left(f_{0} \circ f_{1}\right) \otimes\left(g_{0} \circ g_{1}\right)=\left(f_{0} \otimes g_{0}\right) \circ\left(f_{1} \otimes g_{1}\right)
$$

where $\circ$ is the function composition. An example that we will frequently encounter in this paper is the case $g_{0}=g_{1}=\operatorname{id}_{m}$, the identity map of $R^{m}$. Suppose we have a factorization for $f: R^{n} \rightarrow R^{n}$ with $f=f_{0} \circ f_{1}$, then we also have

$$
f \otimes \operatorname{id}_{m}=\left(f_{0} \circ f_{1}\right) \otimes\left(\operatorname{id}_{m} \circ \operatorname{id}_{m}\right)=\left(f_{0} \otimes \operatorname{id}_{m}\right) \circ\left(f_{1} \otimes \operatorname{id}_{m}\right)
$$

In general, if $f$ factors into $f_{0} \circ \cdots \circ f_{d-1}$, then $f \otimes \mathrm{id}_{m}=\left(f_{0} \otimes \mathrm{id}_{m}\right) \circ \cdots \circ$ $\left(f_{d-1} \otimes \mathrm{id}_{m}\right)$.

Associative algebras. For an $R$-module $M$, if we adjoin a ring structure to $M$ by introducing a binary associative operator with an identity compatible with $1_{R}$ to the underlying additive group $M$, we call $M$ an associative $R$-algebra. For simplicity, we call an associative $R$-algebra an $R$-algebra or an algebra when the context is clear. For a degree- $n$ polynomial $\boldsymbol{g} \in R[x]$, the quotient ring $R[x] /\langle\boldsymbol{g}\rangle$ is an $R$-algebra since (i) $R[x] /\langle\boldsymbol{g}\rangle$ is a ring and (ii) $R[x] /\langle\boldsymbol{g}\rangle=R^{n}$ as $R$-modules by specifying $x^{i}=(\underbrace{0, \ldots, 0}_{i}, 1, \underbrace{0, \ldots, 0}_{n-1-i})$. Suppose $\boldsymbol{g}=\boldsymbol{g}\left(x^{v}\right)$ for a positive integer $v$, we have $R[x] /\left\langle\boldsymbol{g}\left(x^{v}\right)\right\rangle \cong \mathcal{R}[y] /\langle\boldsymbol{g}(y)\rangle$ where $\mathcal{R}:=R[x] /\left\langle x^{v}-y\right\rangle$. The crucial point is to interpret an $\mathcal{R}$-algebra homomorphism $f_{\mathcal{R}}$ for $\mathcal{R}[y] /\langle\boldsymbol{g}(y)\rangle$ as an $R$-algebra homomorphism for $R[x] /\left\langle\boldsymbol{g}\left(x^{v}\right)\right\rangle$. We claim that $f_{\mathcal{R}} \otimes \operatorname{id}_{v}$ is the desired $R$-algebra homomorphism. Similarly, if we have a factorization of an $\mathcal{R}$-algebra homomorphism $f=f_{0} \circ f_{1}$ for $\mathcal{R}[y] /\langle\boldsymbol{g}(y)\rangle$, we have a composition of $R$-algebra homomorphisms $f_{0} \otimes \mathrm{id}_{v}$ and $f_{1} \otimes \mathrm{id}_{v}$ for $R[x] /\left\langle\boldsymbol{g}\left(x^{v}\right)\right\rangle$.

### 2.3 Vector arithmetic

We go through the vector instruction set/extension covered in this paper.

AVX2. Advanced vector extension 2 (AVX2) is a vector extension to the x86 instruction architecture. In AVX2, there are 16 ymm registers each holding 256 bits of data. In this paper, we only consider 16-bit arithmetic and regard each vectors as packed 16-bit elements. Furthermore, we also have several permutation instructions with two data operands. Frequently, a series of permutation instructions are used for implementing a certain kind of permutation matrices.

Armv8.0-A Neon. The instruction set architecture Armv8.0-A comes with the vector extension Neon. In Neon, there are 32 vector registers (v0 to v31) each holding 128 bits of data and instruction encodings are determined by the instructions along with specifiers following register names. In addition to vector-by-vector multiplication instructions, we have vector-by-scalar multiplications multiplying a vector of elements by a scalar. Similar to AVX2, there is a wide variaty of permutation instructions. One of the convenient ones is ext: we concatenate two 128 -bit vector registers and extract a certain contiguous 16 -byte data from the 32-byte data. This allows us to implement cyclic shifts of tuples in a convenient way.

### 2.4 Cooley-Tukey FFT

Let $n=\prod_{j} n_{j}$, and $i_{j}$ runs over $0, \ldots, n_{j}-1$ for each $j$. The Cooley-Tukey FFT [12] computes $R[x] /\left\langle x^{n}-\zeta^{n}\right\rangle \cong \prod_{i_{0}, \ldots, i_{h-1}} R[x] /\left\langle x-\zeta \omega_{n}^{\sum_{l} i_{l} \Pi_{j<l} n_{j}}\right\rangle$ in a layer-by-layer fashion where $\omega_{n}$ is a principal $n$-th root of unity ${ }^{2}$. The simplest case is the isomorphism $R[x] /\left\langle x^{2^{h}}-1\right\rangle \cong \prod_{i_{0}, \ldots, i_{h-1}} R[x] /\left\langle x-\omega_{2^{h}}^{\sum_{l} i_{l} 2^{l}}\right\rangle$. However, we will encounter various transformations built upon non-power-oftwo Cooley-Tukey FFTs.

### 2.5 Good-Thomas FFT

Good-Thomas FFT is an alternative FFT built upon the coprime factors of the transformation size $n$. We explain the idea briefly with the smallest case $n=6$. Consider the cyclic transformation $\mathbf{F}_{6}: R[x] /\left\langle x^{6}-1\right\rangle \rightarrow \prod_{i} R[x] /\left\langle x-\omega_{6}^{i}\right\rangle$, If we perform pre- and post-permutation for the 1st and the 4th element (we start with 0 ), and define $\omega_{3}:=\omega_{6}^{4}, \omega_{2}:=\omega_{6}^{3}$, we have $P_{(14)} \mathbf{F}_{6} P_{(14)}=\left(\mathbf{F}_{2} \otimes I_{3}\right)\left(I_{2} \otimes \mathbf{F}_{3}\right)$ where $\mathbf{F}_{2}$ and $\mathbf{F}_{3}$ are cyclic transformation of sizes 2 and 3 , respectively. Comparing to Cooley-Tukey FFT, we save two multiplications by $\omega_{6}$ and $\omega_{6}^{2}$.

### 2.6 Truncated Rader's FFT and its Inverse

Let $p$ be an odd prime, and $\mathcal{I}=\{0, \ldots, p-1\}, \mathcal{I}^{*}=\{1, \ldots, p-1\}$ be index sets. Rader's FFT [22] computes the map $R[x] /\left\langle x^{p}-1\right\rangle \cong \prod_{i \in \mathcal{I}} R[x] /\left\langle x-\omega_{p}^{i}\right\rangle$ with a size- $\lambda(p)$ cyclic convolution where $\lambda$ is the Carmichael's lambda function. Due to the page limit, we refer to [22] for the original version and jump into the truncated version introduced by [3].

Let $\Phi_{p}$ be the $p$-th cyclotomic polynomial. Since $p$ is a prime, we have $\Phi_{p}(x)=\sum_{i \in \mathcal{I}} x^{i}$ and $\Phi_{p}(x) \mid\left(x^{p}-1\right)$. A natural question is to build an efficient transformation for $R[x] /\left\langle\Phi_{p}(x)\right\rangle$ from the Rader's FFT for $R[x] /\left\langle x^{p}-1\right\rangle$. We start with isomorphism $\sum_{i \in \mathcal{I}^{*}} a_{i-1} x^{i-1} \mapsto\left(\hat{a}_{j}=\sum_{i \in \mathcal{I}^{*}} a_{i-1} \omega_{p}^{(i-1) j}\right)_{j \in \mathcal{I}^{*}}$ : $R[x] /\left\langle\Phi_{p}(x)\right\rangle \rightarrow \prod_{j \in \mathcal{I}^{*}} R[x] /\left\langle x-\omega_{p}^{j}\right\rangle$, and reindex with $i \mapsto-\log _{g} i$ and

$$
{ }^{2} \forall j=1, \ldots, n-1, \sum_{i=0}^{n-1} \omega_{n}^{i j}=0 .
$$

$j \mapsto \log _{g} j$. We have

$$
\hat{a}_{g^{\log _{g} j}}=\sum_{i \in \mathcal{I}^{*}} a_{i-1} \omega_{p}^{(i-1) j}=\omega_{p}^{-j} \sum_{-\log _{g} i \in \mathbb{Z}_{\lambda(p)}} a_{g^{\log _{g} i}-1} \omega_{p}^{g^{\log _{g} i+\log _{g} j}}
$$

and find that $\left(\omega_{p}^{k} \hat{a}_{g^{k}}\right)_{k \in \mathbb{Z}_{\lambda(p)}}$ is the convolution of $\left(a_{g^{-k}-1}\right)_{k \in \mathbb{Z}_{\lambda(p)}}$ and $\left(\omega_{p}^{g^{k}}\right)_{k \in \mathbb{Z}_{\lambda(p)}}$. This is called the truncated Rader's FFT. Below is an illustration for $p=5$ and $g=2: P_{(23)}\left(\begin{array}{cccc}\omega_{5} & \omega_{5}^{2} & \omega_{5}^{3} & \omega_{5}^{4} \\ \omega_{5}^{2} & \omega_{5}^{4} & \omega_{5} & \omega_{5}^{3} \\ \omega_{5}^{3} & \omega_{5} & \omega_{5}^{4} & \omega_{5}^{2} \\ \omega_{5}^{4} & \omega_{5}^{3} & \omega_{5}^{2} & \omega_{5}\end{array}\right) P_{(312)}=\left(\begin{array}{cccc}\omega_{5}^{2^{0}} & \omega_{5}^{2^{3}} & \omega_{5}^{2^{2}} & \omega_{5}^{2^{1}} \\ \omega_{5}^{2^{1}} & \omega_{5}^{2^{0}} & \omega_{5}^{2^{3}} & \omega_{5}^{2^{2}} \\ \omega_{5}^{2^{2}} & \omega_{5}^{2^{1}} & \omega_{5}^{2^{0}} & \omega_{5}^{2^{3}} \\ \omega_{5}^{2^{3}} & \omega_{5}^{2^{2}} & \omega_{5}^{2^{1}} & \omega_{5}^{2^{0}}\end{array}\right)$. For the inverse, [3, Section 4.8.2] showed how to implement it with a size- $\lambda(p)$ cyclic convolution. They found that convoluting with $\frac{1}{p}\left(\omega_{p}^{-g^{-k}}-1\right)_{k \in \mathbb{Z}_{\lambda(p)}}$ results in the desired inversion. We illustrate below for $p=5$ and $g=2$ :

$$
\left(\begin{array}{cccc}
\omega_{5}^{2^{0}} & \omega_{5}^{2^{3}} & \omega_{5}^{2^{2}} & \omega_{5}^{2^{1}} \\
\omega_{5}^{2^{1}} & \omega_{5}^{2^{0}} & \omega_{5}^{2^{3}} & \omega_{5}^{2^{2}} \\
\omega_{5}^{2^{2}} & \omega_{5}^{2^{1}} & \omega_{5}^{2^{0}} & \omega_{5}^{2^{3}} \\
\omega_{5}^{2^{3}} & \omega_{5}^{2^{2}} & \omega_{5}^{2^{1}} & \omega_{5}^{2^{0}}
\end{array}\right)\left(\begin{array}{l}
\omega_{5}^{-2^{-0}}-1 \\
\omega_{5}^{-2^{-1}}-1 \\
\omega_{5}^{-2^{-2}}-1 \\
\omega_{5}^{-2^{-3}}-1
\end{array}\right)=\left(\begin{array}{l}
5 \\
0 \\
0 \\
0
\end{array}\right)
$$

In summary, we can implement $\eta^{-1}$ by mapping $\left(\hat{a}_{g^{k}}\right)_{k \in \mathbb{Z}_{\lambda(p)}}$ to $\left(\omega_{p}^{k} \hat{a}_{g^{k}}\right)_{k \in \mathbb{Z}_{\lambda(p)}}$ and convoluting with $\left(\omega_{p}^{-g^{-k}}-1\right)_{k \in \mathbb{Z}_{\lambda(p)}}$. Scaling by $\frac{1}{p}$ is postponed to the end. See [3, Sections 4.12 .3 and 4.12.4] for a generalization to arbitrary $p$.

### 2.7 Bruun's FFT

Let $q$ be a prime with $q \equiv 3 \bmod 4$ and $q+1=r 2^{w}$ for an odd $r$. Bruun's FFT allows us to split $\mathbb{Z}_{q}[x] /\left\langle x^{2^{w}}+1\right\rangle$ into $\prod_{i} \frac{\mathbb{Z}_{q}[x]}{\left\langle x^{2} \pm \alpha_{i} x-1\right\rangle}$. See [8] for a proof. For $q=4591$, we can split $\mathbb{Z}_{q}[x] /\left\langle x^{16}+1\right\rangle$ into size- 2 polynomial rings with moduli of the form $x^{2} \pm \alpha_{i} x-1$ since $4591+1=287 \cdot 2^{4}$. In this paper, we are interested in the case $\mathbb{Z}_{q}[x] /\left\langle x^{16}+1\right\rangle \cong \prod \mathbb{Z}_{q}[x] /\left\langle x^{8} \pm \sqrt{2} x^{4}+1\right\rangle$. See [14, Section 3.3] for its implementation.

Bruun's FFT was originally proposed with $\mathbb{C}$ as the coefficient ring. See [10] for the power-of-two case and [20] for the even case.

### 2.8 Twisting

Let $R$ be a ring, $\zeta \in R$ be an invertible element, $n$ be an integer, and $\xi \in R$ be an element. We have the isomorphism $R[x] /\left\langle x^{n}-\xi \zeta^{n}\right\rangle \cong R[y] /\left\langle y^{n}-\xi\right\rangle$ by sending $x$ to $\zeta y$. This is called twisting. Obviously, twisting amounts to multiplying all the coefficients by certain constants and its transformation matrix is a diagonal matrix. In the literature, twising is commonly specialized to $\xi=1$, but we need the cases $\xi= \pm 1$ in this paper.

### 2.9 Karatsuba

Karatsuba [18] computes the product $\left(a_{0}+a_{1} x\right)\left(b_{0}+b_{1} x\right)$ by evaluating at the point set $\{0,1, \infty\}$. We compute $\left(a_{0}+a_{1} x\right)\left(b_{0}+b_{1} x\right)=a_{0} b_{0}+\left(a_{0} b_{1}+a_{1} b_{0}\right) x+$ $a_{1} b_{1} x^{2}$ with three multiplications $a_{0} b_{0}, a_{1} b_{1}$, and $\left(a_{0}+a_{1}\right)\left(b_{0}+b_{1}\right)$ by observing $a_{0} b_{1}+a_{1} b_{0}=\left(a_{0}+a_{1}\right)\left(b_{0}+b_{1}\right)-a_{0} b_{0}-a_{1} b_{1}$.

## 3 Formalization of Vectorization

We formalize "vectorization-friendliness" and "permutation-friendliness", and review the role of Toeplitz matrix-vector products in vectorization. While computing with vector instructions, we choose algebra homomorphisms $f$ and $g$ such that $f$ is vectorization-friendly and $g$ is permutation-friendly or amounts to computing Toeplitz matrices when there are vector-by-scalar multiplication instructions. Their composition $g \circ f$ then admits suitable mapping to vector arithmetic.

Section 3.1 formalizes vectorization-friendliness capturing the power-of-two nature of vector registers, Section 3.2 formalizes permutation-friendliness capturing the permutation nature, and Section 3.3 reviews small-dimensional Toeplitz matrix-vector products.

### 3.1 Vectorization-Friendliness

Conceptually, we call an algebra homomorphism vectorization-friendly if we can factor it into module homomorphisms with matrix forms of certain kinds of block diagonal matrices or tensor products with $I_{v}$ as the right operand. We first identify a set of matrices that can be implemented efficiently with vector instructions straightforwardly. Let $v^{\prime}$ be a multiple of $v$. We define BlockDiag as the set of all block diagonal matrices with each block a $v^{\prime} \times v^{\prime}$ matrix of the following form:

1. Diagonal matrix: a matrix with non-diagonal entries all zeros.
2. Cyclic/negacyclic shift matrix: a matrix implementing $\left(a_{i}\right) \mapsto\left(a_{(i+c) \bmod v^{\prime}}\right)$ (cyclic) or $\left(a_{i}\right) \mapsto\left((-1) \llbracket i+c \geq v^{\prime} \rrbracket a_{(i+c) \bmod v^{\prime}}\right)$ (negacyclic) for a non-negative integer $c$.

Diagonal matrices are suitable for vectorization since we can load $v$ coefficients, multiply them by $v$ constants, and store them back to memory with vector instructions. For cyclic/negacyclic shift matrices, we discuss how to implement them for the following vector instruction sets:

- Armv7/8-A Neon: For cyclic shifts, we use the instruction ext extracting consecutive elements from a pair of vector registers. We negate one of the registers before applying ext for negacyclic shifts [14].
- AVX2: For cyclic shifts, we perform unaligned loads, shuffle the last vector register, and store the vectors to memory. Again, the last vector register is negated for negacyclic shifts [5].

Let $f$ be an algebra monomorphism, and $M_{f}$ be the matrix form of $f$. We call $f$ vectorization-friendly if

$$
M_{f}=\prod_{i}\left(M_{f_{i}} \otimes I_{v}\right) S_{f_{i}}
$$

for some $M_{f_{i}}$ and $S_{f_{i}} \in$ BlockDiag. The tensor product $M_{f_{i}} \otimes I_{v}$ ensures that each $v$-chunk is regarded as a whole while applying $M_{f_{i}} \otimes I_{v}$. Additionally, $f$ is vectorization-friendly if and only if $f^{-1}$ is vectorization-friendly, so we only need to discuss the vectorization-friendliness of a monomorphism and its inverse follows immediately.

### 3.2 Permutation-Friendliness

We introduce the notion "permutation-friendliness". Conceptually, permutationfriendliness stands for vectorization-friendliness after applying a special type of permutation - interleaving. Again, let $v^{\prime}$ be a multiple of $v$. We define the transposition matrix $T_{v^{\prime 2}}$ as the $v^{2} \times v^{2}$ matrix permuting the elements as if transposing a $v^{\prime} \times v^{\prime}$ matrix. Now we are ready to specify the set Interleave of interleaving matrices. We call a matrix $M$ interleaving matrix with step $v^{\prime}$ if it takes the form

$$
M=\left(\pi^{\prime} \otimes I_{v^{\prime}}\right)\left(I_{m} \otimes T_{v^{\prime 2}}\right)\left(\pi \otimes I_{v^{\prime}}\right)
$$

for a positive integer $m$ and permutation matrices $\pi, \pi^{\prime}$ permuting $m v^{\prime}$ elements. The set Interleave consists of interleaving matrices of all possible steps and is closed under inversion.

We call an algebra monomorphism $g$ permutation-friendly if we can factor its matrix form $M_{g}^{\prime}$ as

$$
M_{g}^{\prime}=\prod_{i} S_{g_{i}} M_{g_{i}}
$$

for $S_{g_{i}} \in$ Interleave and vectorization-friendly $M_{g_{i}}$ 's. Immediately, we know that $g$ is permutation-friendly if and only if $g^{-1}$ is permutation-friendly.

### 3.3 Toeplitz Matrix-Vector Product (Small Dimensional)

We go through an alternative for permutation friendliness when there are vector-by-scalar multiplication instructions. Suppose we have a vectorization-friendly monomorphism resulting several small-dimensional power-of-two-size cyclic/negacyclic convolutions. By the definition of vectorization-friendliness, a cyclic/negacyclic convolution can be phrased as applying a $v^{\prime} \times v^{\prime}$ Toeplitz matrix to a vector for a $v$-multiple $v^{\prime}$. We call a matrix $M$ Toeplitz if $M_{i, j}=M_{i+1, j+1}$ for all possible $i, j$. Generally, one can write a polynomial multiplication modulo $x^{v^{\prime}}-\zeta$ as an application of a Toeplitz matrix constructed from one of the operands $[6,15,19]$. Recently, [11] decomposed the application of a $v^{\prime} \times v^{\prime}$ Toeplitz matrix as a sum of column-to-scalar multiplications and implemented each with a vector-by-scalar multiplication instruction.

## 4 Vectorized Polynomial Multipliers

This section describes our polynomial multiplications for $\mathbb{Z}_{4591}[x] /\left\langle x^{761}-x-1\right\rangle$. A standard approach is to multiply in $\mathbb{Z}_{4591}[x] /\langle\boldsymbol{g}\rangle$ with $\operatorname{deg}(\boldsymbol{g})>2 \cdot 760$ followed by polynomial reduction modulo $x^{761}-x-1$. We propose to multiply in $\mathbb{Z}_{4591}[x] /\left\langle\Phi_{17}\left(x^{96}\right)\right\rangle$ where $\operatorname{deg}\left(\Phi_{17}\left(x^{96}\right)\right)=1536>2 \cdot 760$. For simplicity, we assume $R=\mathbb{F}_{4591}$ in this section.

There are two steps for deciding isomorphisms admitting suitable mapping to vector arithmetic. The first step is to find an isomorphism honoring our intuition of the memory layout - we choose an isomorphism dividing a large problem into several subproblems of sizes multiples of $v$ (the number of elements contained in a vector register). Section 4.1 describes our isomorphisms resulting several size- 16 subproblems. The second step is to decide isomorphisms computing the remaining task. Section 4.2.1 discusses a permutation-friendly approach and Section 4.2.2 discusses our Toeplitz matrix-vector product approach. Finally, we go through a detailed comparisons to existing works with emphases on vectorization-friendliness and permutation-friendliness in Section 4.3.

### 4.1 The Vectorization-Friendly Phase

We first go through the implementation of

$$
\frac{R[x]}{\left\langle\Phi_{17}\left(x^{96}\right)\right\rangle} \cong\left(\prod \frac{R[x]}{\left\langle x^{16} \pm 1\right\rangle}\right)^{48}
$$

Since the resulting polynomial rings have size 16 , our transformation is evidently vectorization-friendly. We detail below the contructions and their vectorizationfriendliness.

### 4.1.1 Truncated Rader's FFT

Let $\eta_{0}: R^{16} \rightarrow R^{16}$ be the module map implementing the permutation and cyclic convolution parts of the truncated size-17 Rader's FFT. $R[x] /\left\langle\Phi_{17}(x)\right\rangle \cong$ $\prod_{i=0}^{15} R[x] /\left\langle x-\omega_{17}^{i+1}\right\rangle$ is implemented as $\mathrm{mul}_{0} \circ \eta_{0}$ where $\mathrm{mul}_{0}:=\left(a_{i}\right)_{i=0, \ldots, 15} \mapsto$ $\left(\omega_{17}^{-(i+1)} a_{i}\right)_{i=0, \ldots, 15}$. We tensor the composition $\operatorname{mul}_{0} \circ \eta_{0}$ by $I_{96}$ to implement $R[x] /\left\langle\Phi_{17}\left(x^{96}\right)\right\rangle \cong \prod_{i=0}^{15} R[x] /\left\langle x^{96}-\omega_{17}^{i+1}\right\rangle$. We then twist all the rings to the cyclic ones via the product map twist ${ }_{0}:=\prod_{i=0}^{15}\left(x \mapsto \omega_{17}^{14(i+1)} x\right)^{3}$. To sum up, we implement $R[x] /\left\langle\Phi_{17}\left(x^{96}\right)\right\rangle \cong\left(R[x] /\left\langle x^{96}-1\right\rangle\right)^{16}$ as

$$
\text { twist }_{0} \circ\left(\left(\mathrm{mul}_{0} \circ \eta_{0}\right) \otimes I_{96}\right)
$$

which is obviously vectorization friendly.

[^1]

Fig. 1: Overview of the correspondence between algebraic maps and 128 -bit vector register view in Neon. Each rectangles holds $\frac{128}{16}=8$ coefficients and is loaded to a vector register. Similar justification of vectorization-friendliness holds if we move to 256 -bit vector registers in AVX2.

### 4.1.2 Good-Thomas FFT

Next, we turn the ring $R[x] /\left\langle x^{96}-1\right\rangle$ into $\left(\prod R[x] /\left\langle x^{16} \pm 1\right\rangle\right)^{3}$ by applying Good-Thomas FFT and twisting. Let $\eta_{1}$ be the map implementing the GoodThomas FFT of dimension $3 \times 2$, and twist $_{1}$ twisting the product ring into $\left(\prod R[x] /\left\langle x^{16} \pm 1\right\rangle\right)^{3}$. Then, twist ${ }_{1} \circ\left(\eta_{1} \otimes I_{16}\right)$ implements $R[x] /\left\langle x^{96}-1\right\rangle \cong$ $\left(\prod R[x] /\left\langle x^{16} \pm 1\right\rangle\right)^{3}$. Since there are 16 copies of $R[x] /\left\langle x^{96}-1\right\rangle$, we have

$$
I_{16} \otimes\left(\text { twist }_{1} \circ\left(\eta_{1} \otimes I_{16}\right)\right)=\left(I_{16} \otimes \text { twist }_{1}\right) \circ\left(I_{16} \otimes \eta_{1} \otimes I_{16}\right)
$$

as the overall transformation. Obviously, this is vectorization friendly.
For a more illustrative explanation of how polynomials are mapped to 128-bit registers, we outline the workflow in Figure 1 where each rectangles represents a 128-bit register. Note that similar justification holds for 256 -bit registers since we are right-tensoring by $I_{16}$.

### 4.2 Small-Dimensional Cyclic/Negacyclic Convolutions

This section goes through our approaches multiplying in

$$
\left(\prod \frac{R[x]}{\left\langle x^{16} \pm 1\right\rangle}\right)^{48}
$$

We propose two approaches: a permutation-friendly approach for AVX2 and a Toeplitz matrix-vector product approach for Neon.

### 4.2.1 A Permutation-Friendly Approach



Fig. 2: Overview of permutations implementing permutation-friendliness for our AVX2 implementation defined on $\left(R[x] /\left\langle x^{16} \pm 1\right\rangle\right)^{16}$. Same idea applies to $\left(R[x] /\left\langle x^{16} \pm 1\right\rangle\right)^{48}$ since $48=3 \cdot 16$. Each rectangles represents a 16 -tuple mapped to a 256 -bit vector register in AVX2.

We first go through the permutation-friendly approach used in our AVX2 implementation. Since the goal is to interleave 16 polynomial rings with the same shape of computation, we demonstrate how to map the multiplication in $\left(\prod R[x] /\left\langle x^{16} \pm 1\right\rangle\right)^{16}$ to vector arithmetic. We perform an even-odd permutation over 16-tuples resulting $\left(R[x] /\left\langle x^{16}-1\right\rangle\right)^{16} \times\left(R[x] /\left\langle x^{16}+1\right\rangle\right)^{16}$ followed by two copies of $T_{256}$. This gives us the map

$$
\left(I_{2} \otimes T_{256}\right)\left(\text { EvenOdd }_{32} \otimes I_{16}\right)
$$

where EvenOdd ${ }_{32}$ moves the even indices to the first half and the odd indices to the second half. See Figure 2 for an illustration. The overall interleaving matrix for $\left(\prod \frac{R[x]}{\left\langle x^{16} \pm 1\right\rangle}\right)^{48}$ can be written as:

$$
\left(I_{6} \otimes T_{256}\right)\left(I_{3} \otimes \text { EvenOdd }_{32} \otimes I_{16}\right)
$$

which is permutation-friendly. Finally, we apply Cooley-Tukey to $R[x] /\left\langle x^{16}-1\right\rangle \cong$ $\prod R[x] /\left\langle x^{8} \pm 1\right\rangle$ and Bruun to $R[x] /\left\langle x^{16}+1\right\rangle \cong R[x] /\left\langle x^{8} \pm \sqrt{2} x^{4}+1\right\rangle$ followed by Karatsuba defined over vector registers.

### 4.2.2 Toeplitz Matrix-Vector Products

Recall that one can phrases polynomial multiplications in $R[x] /\left\langle x^{v^{\prime}} \pm 1\right\rangle$ as Toeplitz matrix-vector products for $v^{\prime}$ a multiple of $v$ (cf. Section 3.3). We describe an alternative approach for multiplying in $\left(\prod R[x] /\left\langle x^{16} \pm 1\right\rangle\right)^{48}$ with Neon. Since each vector registers in Neon holds eight coefficients, we first split $R[x] /\left\langle x^{16}-1\right\rangle$ into $\prod R[x] /\left\langle x^{8} \pm 1\right\rangle$, and apply Toeplitz matrix-vector multiplications in $R[x] /\left\langle x^{8} \pm 1\right\rangle$ and $R[x] /\left\langle x^{16}-1\right\rangle$. The implementations follow analogously from [11].

### 4.3 Comparisons to Prior Implementations

We compare our vectorized implementations to prior FFT works operating over $R=\mathbb{Z}_{4591}$. Table 1 summarizes the vectorization- and permutation-friendliness of existing polynomial multipliers over $R$. Table 2 summarizes existing vectorizationfriendly approaches with AVX2 and Neon, Table 3 summarizes existing permutationfriendly approaches with AVX2, and Table 4 summarizes existing permutationfriendly and Toeplitz matrix-vector product approaches with Neon.

|  | $[1]$ | $[5]$ | $[14]$ | This work |
| :--- | :---: | :---: | :---: | :---: |
| ISA/extension | Armv7E-M | AVX2 | Neon | Neon/AVX2 |
| \# halfword |  |  |  |  |
| in a vector register | 2 | 16 | 8 | $8 / 16$ |
| Domain | $\frac{R[x]}{\left\langle x^{1530}-1\right\rangle}$ | $\frac{R[x]}{\left\langle\frac{x^{2048}-1}{\left.x^{512}\right\rangle}\right\rangle}$ | $\frac{R[x]}{\left\langle x^{1632}-1\right\rangle}$ | $\frac{R[x]}{\left\langle\Phi_{17}\left(x^{\left.\left.x^{96}\right)\right\rangle}\right.\right.}$ |
| FFT | Rader, <br> CT | Schönhage, <br> Nussbaumer | Rader, <br> GT | truncated Rader, <br> GT |
| Vectorization-friendly | $v=2$ (Yes) | $v=64$ (Yes) | $v=32$ (Yes) | $v=32$ (Yes) |
| Permutation-friendly | $v=1$ (No) | $v=32$ (Yes) | $v=4$ (No) | $v=16$ (Yes) |

Table 1: Summary of maximum possible $v$ justifying vectorization- and permutation-friendliness of existing polynomial multipliers over $\mathbb{Z}_{4591}$ for $\mathbb{Z}_{4591}[x] /\left\langle x^{761}-x-1\right\rangle$. CT stands for Cooley-Tukey FFT and GT stands for Good-Thomas FFT.

Comparison(s) to $R[x] /\left\langle x^{1530}-1\right\rangle$ from [1]. The earliest FFT work over $R$ was implemented by [1]. Since 4591 is a prime, one can only define Cooley-Tukey FFTs of sizes factors of $4591-1=2 \cdot 3^{2} \cdot 5 \cdot 17$. They computed the isomorphsims

|  | $[5]$ | $[14]$ | This work |
| :--- | :---: | :---: | :---: |
| ISA/extension | AVX2 | Neon | Neon/AVX2 |
| Domain | $\frac{R[x]}{\left\langle\frac{x^{2}+48-1}{x^{512}+1}\right\rangle}$ | $\frac{R[x]}{\left\langle x^{1632}-1\right\rangle}$ | $\frac{R[x]}{\left\langle\Phi_{17}\left(x^{96}\right)\right\rangle}$ |
| FFT | Schönhage | Rader-17+GT | truncated Rader-17+GT |
| Image | $\left(\frac{R[x]}{\left\langle x^{64}+1\right\rangle}\right)^{48}$ | $\prod_{i} \frac{R[x]}{\left\langle x^{16}-\omega_{102}^{i}\right\rangle}$ | $\left(\prod \frac{R[x]}{\left\langle x^{16} \pm 1\right\rangle}\right)^{48}$ |

Table 2: Summary of vectorization-friendly approaches.

|  | $[5]$ | This work |
| :--- | :---: | :---: |
| Domain | $\left(\frac{R[x]}{\left\langle x^{64}+1\right\rangle}\right)^{48}$ | $\left(\prod \frac{R[x]}{\left\langle x^{16} \pm 1\right\rangle}\right)^{48}$ |
| FFT | Nussbaumer | CT + Bruun |
| Image | $\left(\frac{R[z]}{\left\langle z^{8}+1\right\rangle}\right)^{768}$ | $\left(\prod \frac{R[x]}{\left\langle x^{8} \pm 1\right\rangle} \times \prod \frac{R[x]}{\left\langle x^{8} \pm \sqrt{2} x^{4}+1\right\rangle}\right)^{48}$ |
| Follow up polymul. | Recursive K | K |
| Multiplication instruction | Vector-by-vector | Vector-by-vector |

Table 3: Summary of permutation-friendly approaches with AVX2. K stands for Karatsuba.
$R[x] /\left\langle x^{1530}-1\right\rangle \cong \prod_{i} R[x] /\left\langle x^{90}-\omega_{17}^{i}\right\rangle \cong \prod_{i} R[x] /\left\langle x^{10}-\omega_{102}^{i}\right\rangle$ with size-17 Rader's and Cooley-Tukey FFTs. Since 2 is the only power-of-two factor of 1530, their isomorphisms are not vectorization-friendly if there are more than two elements in a vector register.

Comparison(s) to $R[x] /\left\langle\frac{x^{2048}-1}{x^{512}+1}\right\rangle$ from [5]. We compare our AVX2 implementation to the state-of-the-art AVX2 work by [5]. In [5], they made a first attempt to deliver a large-dimensional power-of-two-sized FFT polynomial multiplier in AVX2 based on Schönhage's and Nussbaumer's FFTs. Since $\left(x^{2048}-1\right) /\left(x^{512}+1\right)$ is a factor of $x^{2048}-1$, they applied the Schönhage's FFT in a similar way for $R[x] /\left\langle x^{2048}-1\right\rangle$, leading to polynomial multiplications in $R[x] /\left\langle x^{64}+1\right\rangle$. They then applied Nussbaumer's FFT to all the 48 copies of $R[x] /\left\langle x^{64}+1\right\rangle$. One can show that power-of-two Schönhage's FFT is vectorization-friendly and Nussbaumer's FFT is permutation-friendly, and the overall computation is suitable for vectorization. As for polynomial multiplications in $R[z] /\left\langle z^{8}+1\right\rangle$, they applied recursive Karatsuba. The downsize of their approach is the number of subproblems. Since each applications of Schönhage's and Nussbaumer's FFTs doubles the number of coefficients, there are eventually $\frac{1536 \cdot 4}{8}=768$ polynomial multiplications in the ring $R[z] /\left\langle z^{8}+1\right\rangle$. In our transformation for AVX2, we only need $48 \cdot 4=192$ size- 8 polynomial multipli-
cations. This is the main reason why our AVX2 implementation outperform [5]'s implementation.

Comparison(s) to $R[x] /\left\langle x^{1632}-1\right\rangle$ from [14]. Finally, we compare our Neon implementation to the state-of-the-art Neon work by [14]. They applied a 3-dimensional Good-Thomas FFT to $R[x] /\left\langle x^{1632}-1\right\rangle$ buit upon the coprime factorization $\frac{1632}{16}=2 \cdot 3 \cdot 17$ and Rader's FFT for the size- 17 transformation, resulting in $R[x] /\left\langle x^{16}-\omega_{102}^{i}\right\rangle$ up to a suitable permutation. Since $102 \cdot 16=$ 1632 is not a multiple of 64 (there are 8 elements in each vector register and $64=$ $8^{2}$ ), the follow up computation can't be permutation-friendly. They then applied radix-2 Cooley-Tukey and Bruun's FFT to $\prod_{i<96} R[x] /\left\langle x^{16}-\omega_{102}^{i}\right\rangle$. For the remaining part $\prod_{i \geq 96} R[x] /\left\langle x^{16}-\omega_{102}^{i}\right\rangle$, they interleaved the polynomials with don't-cares and applied naïve computation. Our transformation removes this part.

|  | [14] | This work |
| :---: | :---: | :---: |
| Domain | $\prod_{i} \frac{R[x]}{\left\langle x^{16}-\omega_{102}^{i}\right\rangle}$ | $\left(\prod \frac{R[x]}{\left\langle x^{16} \pm 1\right\rangle}\right)^{48}$ |
| FFT | CT + Bruun | CT |
| Image |  | $\left(\prod \frac{R[x]}{\left\langle x^{8} \pm 1\right\rangle} \times \frac{R[x]}{\left\langle x^{16}+1\right\rangle}\right)^{48}$ |
| Follow up polymul. | Naïve (size-8) + K (size-16) | Toeplitz |
| Multiplication instruction | Vector-by-vector | Vector-by-scalar |

Table 4: Summary of permutation-friendly and Toeplitz matrix-vector product approaches multiplying small-dimensional polynomials in Neon.

## 5 Results

### 5.1 Benchmarking Environment

Intel processors with AVX2. We benchmark our AVX2 implementation on a single core of an $\operatorname{Intel}(\mathrm{R})$ Core(TM) i7-4770K (Haswell) processor with frequency 3.5 GHz , and $\operatorname{Intel}(\mathrm{R}) \mathrm{Xeon(R)}$ CPU E3-1275 v5 (Skylake) with frequency 3.6 GHz . For benchmarking polynomial multiplications, we compile with GCC 10.4.0 on Haswell and GCC 11.3.0 on Skylake using the optimization flag 03. For the batch key generation, we reuse the libsntrup761-20210608 package from [5]. For the encapsulation and decapsulation, we benchmark with the bench-
marking framework SUPERCOP, version supercop-20230530. TurboBoost and hyperthreading are disabled throughout the entire benchmarking.

Armv8.0 ${ }^{+}$A Neon. We benchmark our Neon implementation on a Raspberry Pi 4 Model B and Apple M1 Pro. Raspberry Pi 4 comes with the quad-core (Cortex-A72) Broadcom BCM2711 chipset and runs at 1.5 GHz . Apple M1 Pro is a system-on-chip featuring eight high-performance cores "Firestorms" running at 3.2 GHz and two energy-efficient cores "Icestorm" running at 2.0 GHz . We compile our code with GCC version 12.3 .0 with -03 on Cortex-A72, and GCC version 13.2.0 with -03 on Firestorm.

### 5.2 Performance of Polynomial Multiplication

Table 5: Performance cycles of polynomial multiplications over $\mathbb{Z}_{4591}$ for sntrup761.

| AVX2 |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | [5]* This work |  | [5]* This work |  |
|  | Haswell |  | Skylake |  |
| mulcore ( $\left.\mathbb{Z}_{4591}[x]\right)$ | 23460 | 12336 | 20070 | 9778 |
| polymul $\left(\frac{\mathbb{Z}_{4591}[x]}{\left\langle x^{761}-x-1\right\rangle}\right)$ | 25356 | 12760 | 21364 | 9876 |
| Neon |  |  |  |  |
|  | [14] This work\|| |  | [14]* ${ }^{\text {T }}$ This work |  |
|  | Cortex-A72 |  | Apple M1 Pro |  |
| mulcore ( $\left.\mathbb{Z}_{4591}[x]\right)$ | 37475 | 29909 | 8120 | 6508 |
| polymul $\left(\frac{\mathbb{Z}_{4591}[x]}{\left\langle x^{761}-x-1\right\rangle}\right)$ | 39788 | 30912 | 9091 | 6697 |

* Our own benchmarks.

We provide the performance cycles of functions mulcore and polymul in Table 5 . mulcore computes the product in $\mathbb{Z}_{4591}[x]$ with potential scaling by a predefined constant, and polymul additionally reduces the product modulo $x^{761}-x-1$ and mitigates the potential scaling. Our AVX2-optimized mulcore outperforms the state-of-the-art AVX2 implementation from [5] by factors of $1.90 \times$ and $2.05 \times$ on Haswell and Skylake, and polymul outperforms the state-of-the-art AVX2 implementation by factors of $1.99 \times$ and $2.16 \times$ on Haswell and Skylake. As for our Neon-optimized mulcore and polymul, they outperform the state-of-the-art Neon implementation from [14] by factors of $1.25 \times$ and $1.29 \times$ on Cortex-A72, and $1.25 \times$ and $1.36 \times$ on Appl1 M1 Pro.

### 5.3 Performance of Scheme

Finally, we compare the overall performance of sntrup761, and summarize them in Table 6.

AVX2 code package(s). For the AVX2-optimized implementation, we integrate our code into the package libsntrup761 with version 20210608 provided by [5], and report the the amortized cost of batch key generation with batch size 32. Additionally, we also integrate our code into the package supercop with version 20230530, and report the performance of encapsulation and decapsulation after contacting the authors of [5] for reproducing the numbers in their work.

Neon code package(s). For the Neon-optimized implementation, We integrate our code into the artifact provided by [14].

Overall performance with AVX2. For the batch key generation with batch size 32 , we reduce the amortized cost by $12.0 \%$ on Haswell and $7.9 \%$ on Skylake. For encapsulation, we reduce the cost by $7.1 \%$ on Haswell and $10.3 \%$ on Skylake. For decapsulation, we reduce the cost by $10.7 \%$ on Haswell and $13.3 \%$ on Skylake.

Overall performance with Neon. For the encapsulation, we reduce the cycles by $6.6 \%$ on Cortex-A72 and $3.0 \%$ on Apple M1 Pro, and for the decapsulation, we reduce the cycles by $15.1 \%$ on Cortex-A72 and $12.8 \%$ on Apple M1 Pro.

Table 6: Overall performance of our AVX2 implementation on Haswell and Skylake and our Neon implementation on Cortex-A72 and Apple M1.

| AVX2 |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | Haswell |  | Skylake |  |
|  | [5]** | This work\| | [5]** | This work |
| Batch key generation | 154552 | 136003 | 129159 | 118939 |
|  | SUPERCOP | This work | SUPERCOP | This work |
| Encapsulation | 47464 | 44108 | 40653 | 36486 |
| Decapsulation | 56064 | 50080 | 47387 | 41070 |
| Neon |  |  |  |  |
|  | Cortex-A72 |  | Apple M1 Pro |  |
|  | [14]** | This work\| | [14]** | This work |
| Key generation | 6574055 | 6539849 \|| | 1813947 | 1806741 |
| Encapsulation | 150054 | 140107 | 64924 | 62959 |
| Decapsulation | 159286 | 135184 | 43778 | 38196 |

[^2]
## References

[1] Alkim, E., Cheng, D.Y.L., Chung, C.M.M., Evkan, H., Huang, L.W.L., Hwang, V., Li, C.L.T., Niederhagen, R., Shih, C.J., Wälde, J., Yang, B.Y.: Polynomial Multiplication in NTRU Prime Comparison of Optimization Strategies on Cortex-M4. IACR Transactions on Cryptographic Hardware and Embedded Systems 2021(1), 217-238 (2021), https://tches.iacr. org/index.php/TCHES/article/view/8733 3, 14
[2] Alkim, E., Hwang, V., Yang, B.Y.: Multi-Parameter Support with NTTs for NTRU and NTRU Prime on Cortex-M4. IACR Transactions on Cryptographic Hardware and Embedded Systems 2022(4), 349-371 (2022) 3
[3] Bernstein, D.J.: Fast norm computation in smooth-degree abelian number fields. Cryptology ePrint Archive, Paper 2022/980 (2022), https: //eprint.iacr.org/2022/980 3, 7, 8
[4] Bernstein, D.J., Brumley, B.B., Chen, M.S., Chuengsatiansup, C., Lange, T., Marotzke, A., Peng, B.Y., Tuveri, N., van Vredendaal, C., Yang, B.Y.: NTRU Prime. Submission to the NIST Post-Quantum Cryptography Standardization Project [21] (2020), https://ntruprime.cr.yp.to/ 3, 4
[5] Bernstein, D.J., Brumley, B.B., Chen, M.S., Tuveri, N.: OpenSSLNTRU: Faster post-quantum TLS key exchange. In: 31st USENIX Security Symposium (USENIX Security 22). pp. 845-862 (2022) 2, 3, 5, 9, 14, 15, 16, 17, 18
[6] Bernstein, D.J., Duif, N., Lange, T., Schwabe, P., Yang, B.Y.: High-speed high-security signatures. Journal of cryptographic engineering 2(2), 77-89 (2012) 10
[7] Bernstein, D.J., Yang, B.Y.: Fast constant-time gcd computation and modular inversion. IACR Transactions on Cryptographic Hardware and Embedded Systems 2019(3), 340-398 (2019), https://tches.iacr.org/index. php/TCHES/article/view/8298 5
[8] Blake, I.F., Gao, S., Mullin, R.C.: Explicit Factorization of $x^{2^{k}}+1$ over $\mathbb{F}_{p}$ with Prime $p \equiv 3 \bmod 4$. Applicable Algebra in Engineering, Communication and Computing 4(2), 89-94 (1993) 8
[9] Bourbaki, N.: Algebra I. Springer (1989) 5
[10] Bruun, G.: z-transform DFT Filters and FFT's. IEEE Transactions on Acoustics, Speech, and Signal Processing 26(1), 56-63 (1978) 8
[11] Chen, H.T., Chung, Y.H., Hwang, V., Yang, B.Y.: Algorithmic Views of Vectorized Polynomial Multipliers - NTRU. In: Chattopadhyay, A., Bhasin, S., Picek, S., Rebeiro, C. (eds.) Progress in Cryptology - INDOCRYPT 2023. pp. 177-196. Springer Nature Switzerland (2024), https : //link.springer.com/chapter/10.1007/978-3-031-56235-8_9 10, 14
[12] Cooley, J.W., Tukey, J.W.: An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation 19(90), 297-301 (1965) 7
[13] Franchetti, F., Low, T.M., Popovici, D.T., Veras, R.M., Spampinato, D.G., Johnson, J.R., Püschel, M., Hoe, J.C., Moura, J.M.: Spiral: Extreme performance portability. Proceedings of the IEEE 106(11), 1935-1968 (2018), https://ieeexplore.ieee.org/document/8510983 3
[14] Hwang, V., Liu, C.T., Yang, B.Y.: Algorithmic Views of Vectorized Polynomial Multipliers - NTRU Prime (2024), to appear at ACNS 2024, currently available at https://eprint.iacr.org/2023/1580 3, 8, 9, 14, 15, 16, 17, 18
[15] Hwang, V.B.: Case Studies on Implementing Number-Theoretic Transforms with Armv7-M, Armv7E-M, and Armv8-A. Master's thesis (2022), https: //github.com/vincentvbh/NTTs_with_Armv7-M_Armv7E-M_Armv8-A 10
[16] Jacobson, N.: Basic Algebra I. Courier Corporation (2012) 5
[17] Jacobson, N.: Basic Algebra II. Courier Corporation (2012) 5
[18] Karatsuba, A.A., Ofman, Y.P.: Multiplication of many-digital numbers by automatic computers. In: Doklady Akademii Nauk. vol. 145(2), pp. 293-294 (1962) 9
[19] Írem Keskinkurt Paksoy, Cenk, M.: Faster NTRU on ARM Cortex-M4 with TMVP-based multiplication (2022), https://ia.cr/2022/300 10
[20] Murakami, H.: Real-valued fast discrete Fourier transform and cyclic convolution algorithms of highly composite even length. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. vol. 3, pp. 1311-1314 (1996) 8
[21] NIST, the US National Institute of Standards and Technology: Postquantum cryptography standardization project. https://csrc.nist.gov/ Projects/post-quantum-cryptography 19
[22] Rader, C.M.: Discrete fourier transforms when the number of data samples is prime. Proceedings of the IEEE 56(6), 1107-1108 (1968) 7
[23] Shor, P.W.: Algorithms for quantum computation: discrete logarithms and factoring. In: Proceedings 35th annual symposium on foundations of computer science. pp. 124-134. IEEE (1994) 2


[^0]:    ${ }^{1}$ https://marc.info/?l=openssh-unix-dev\&m=164939371201404\&w=2.

[^1]:    ${ }^{3}$ Notice that $\omega_{17}=\omega_{17}^{1344}=\left(\omega_{17}^{14}\right)^{96}$.

[^2]:    ** Our own benchmarks.

