# Barrett Multiplication for Dilithium on Embedded Devices 

# Studies on the impact of available multiplication instructions 

Vincent Hwang ${ }^{1}$, YoungBeom Kim ${ }^{2}$ and Seog Chung Seo $^{2}$<br>${ }^{1}$ Max Planck Institute for Security and Privacy, Bochum, Germany<br>vincentvbh7@gmail.com<br>${ }^{2}$ Kookmin University, Seoul, Korea


#### Abstract

We optimize the number-theoretic transforms (NTTs) in Dilithium - a digital signature scheme recently standardized by the National Institute of Standards and Technology (NIST) - on Cortex-M3 and 8-bit AVR. The core novelty is the exploration of micro-architectural insights for modular multiplications. Recent work [Becker, Hwang, Kannwischer, Yang and Yang, Volume 2022 (1), Transactions on Cryptographic Hardware and Embedded Systems, 2022] found a correspondence between Montgomery and Barrett multiplications by relating modular reductions to integer approximations and demonstrated that Barrett multiplication is more favorable than Montgomery multiplication by absorbing the subtraction to the low multiplication. We first point out the benefit of Barrett multiplication when long and high multiplication instructions are unavailable, unusable, or slow. We then generalize the notion of integer approximations and improve the emulation of high multiplications used in Barrett multiplication. Compared to the state-of-the-art assembly-optimized implementations on CortexM3, our constant-time NTT/iNTT are 1.38-1.51 times faster and our variable-time NTT/iNTT are 1.10-1.21 times faster. On our 8-bit AVR, we outperform Montgomerybased C implementations of NTT/iNTT by 6.37-7.27 times by simply switching to the proposed Barrett-based implementation. We additionally implement Barrett-based NTT/iNTT in assembly and obtain 14.10-14.42 times faster code. For the overall scheme, we provide speed-optimized implementations for Dilithium parameter sets dilithium2 and dilithium3 on Cortex-M3, and stack-optimized implementations for all parameter sets on Cortex-M3 and 8-bit AVR. We briefly compare the performance of speed-optimized dilithium3. Compared to the state-of-the-art assembly implementation on Cortex-M3, our assembly implementation reduces the key generation, signature generation, and signature verification cycles by $2.30 \%, 23.29 \%$, and $0.69 \%$. In the 8 -bit AVR environment, our Barrett-based C implementation reduces the key generation, signature generation, and signature verification cycles by $45.09 \%, 56.80 \%$, and $50.40 \%$, respectively, and our assemblyoptimized implementation reduces the cycles of each operation by $48.85 \%, 61.70 \%$, and $55.08 \%$, respectively.


Keywords: Modular multiplication • Barrett multiplication • Lattice-based cryptography • Dilithium • Microcontroller • Cortex-M3 • 8-bit AVR

## 1 Introduction

We optimize the number-theoretic transforms (NTTs) in the digital signature scheme Dilithium, recently standardized by the Nation Institute of Standards and Technology on embedded systems. Our core novelty boils down to efficient instantiations of modular
multiplications. Let $q$ be an odd positive integer and $R$ be a power of two, with exponent a power of two. In cryptography, modular multiplication modulo $q$ is one of the core computations. For efficiency reasons, people commonly instantiate arithmetic modulo $q$ with arithmetic modulo $R$. In this paper, we focus on the signed arithmetic and define $\mathbb{Z}_{n}:=$ $\left[-\frac{n}{2}, \frac{n}{2}\right) \cap \mathbb{Z}$ for a positive integer $n$. Furthermore, we denote $\bmod ^{ \pm} n$ the function sending an integer $a$ to the unique integer $a \bmod { }^{ \pm} n$ in $\mathbb{Z}_{n}$ satisfying $a \equiv a \bmod { }^{ \pm} n(\bmod n)^{1}$. We call a multiplication instruction a multiply-low instruction if it multiplies two numbers $a, b$ drawn from $\mathbb{Z}_{\mathbb{R}}$ and computes a value sufficiently close to $a b \bmod { }^{ \pm}$R. Similarly, we call a multiplication instruction a multiply-high instruction if it computes a value sufficiently close to the upper $\log _{2} \mathrm{R}$ bits of $a b$, and a multiply-long instruction if the result is sufficiently close to $a b$. For simplicity, we also call the corresponding accumulative/subtractive variants multiply-low, multiply-high, or multiply-long instructions.

For two integers $a, b \in \mathbb{Z}_{\mathrm{R}}$, Montgomery multiplication (accumulative variant) computes $a b$ with a multiply-long and finds a $\left(2 \log _{2} \mathrm{R}\right)$-bit integer $c$ with $c \equiv 0(\bmod \mathrm{R})$ and $c \equiv a b$ $(\bmod q)$ via the Chinese remainder theorem, which can be implemented by one multiply-low and one multiply-long by applying the divided difference form. Since $c$ is a multiple of $R$, we compute $\frac{c}{\mathrm{R}}$ by extracting the upper $\log _{2} \mathrm{R}$ bits and find $\frac{c}{\mathrm{R}} \equiv a b \mathrm{R}^{-1}(\bmod q)$. If $b$ is a constant known beforehand, we replace $b$ with $b \mathrm{R} \mathrm{mod}{ }^{ \pm} q$ and compute a representative of $a b \bmod { }^{ \pm} q$ [Mon85]. Recently, [Sei18] proposed a subtractive variant of Montgomery multiplication and replaced the multiply-longs by multiply-highs. Barrett multiplication is an alternative approach. Essentially, we compute a "quotient" $\tilde{q}$ with a multiply-high such that $a b-\tilde{q} q$ falls into $\mathbb{Z}_{\mathrm{B}}$ for $q \leq \mathrm{B} \leq \mathrm{R}$. We then compute the products $a b$ and $\tilde{q} q$ by two multiply-lows and subtract them. In summary, we need two multiply-high/longs and one multiply-low for Montgomery multiplication, and one multiply-high and two multiply-lows for Barrett multiplication. See Table 1 for an overview.

Table 1: Overview of the number of multiplication instructions for each type used in Montgomery and Barrett multiplications. Montgomery multiplication (acc.) stands for the accumulative variant and similarly for the subtractive variant.

|  | Multiply-low | Multiply-high | Multiply-long |
| :--- | ---: | ---: | ---: |
| Montgomery multiplication (acc.) | 1 | 0 | 2 |
| Montgomery multiplication (sub.) | 1 | 2 | 0 |
| Barrett multiplication | 2 | 1 | 0 |

From the micro-architectural point of view, multiply-low instructions are fairly common, but multiply-high and multiply-long instructions might be unavailable, incomplete, or unusable on some platforms. For example, on an ARM Cortex-M3 processor implementing Armv7-M, multiply-long instructions take $3-7$ cycles, depending on the magnitude of the result [ARM10, Table 18-1] and cannot be used in computing secret data. In this case, one usually emulates multiply-high and multiply-long instructions with multiply-low, incurring a significant performance penalty. See Table 2 for the available instructions in the instruction set architecture (ISA) Armv7E-M where Armv7E-M stands for Armv7-M with the DSP extension, and See Table 3 for an overview of the timing of multiplication instructions on Cortex-M3 and Cortex-M4. Additionally, in the AVR environment with an 8 -bit register size, our signed implementation of 16 -bit multiply-long takes 18 cycles, while our emulation of signed 32-bit multiply-long (using 8 accumulator registers) takes 102 cycles Therefore, when implementing modular multiplications, avoiding 32-bit multiply-long can lead to significant performance improvements. In other words, Barrett multiplication is obviously more favorable than Montgomery multiplication in both the Cortex-M3 and 8 -bit AVR.

[^0]Table 2: Summary of multiplication instructions for $R=2^{32}$ in Armv7E-M. Instructions followed by (E) are part of the DSP extension and do not exist in Armv7-M.

| Multiply-low | Multiply-high | Multiply-long |
| :---: | :---: | :---: |
| mul, mla, mla | smmul (E), smmulr (E) | smull, smlal, umull, umlal |

Table 3: Summary of multiplication instruction timings on Cortex-M3 (Armv7-M) and Cortex-M4 (Armv7E-M).

|  | Cortex-M3 | Cortex-M4 |
| :--- | ---: | ---: |
| mul | 1 | 1 |
| $\mathrm{mla} / \mathrm{mls}$ | 2 | 1 |
| $\{\mathrm{~s}, \mathrm{u}\}\{\mathrm{mul}, \mathrm{mla}, \mathrm{mls}\} \mathrm{l}$ | $3-7$ | 1 |

Our second observation relies on the approximation nature of Barrett multiplication. Recall that Barrett multiplication computes $a b-\tilde{q} q \in \mathbb{Z}_{\mathrm{B}}$. In the literature, people chose $\tilde{q}=\left\lfloor\frac{a\left\lfloor\frac{b \mathrm{R}}{q}\right\rceil}{\mathrm{R}}\right\rceil$ implementing $\mathrm{B}=2 q$ and $\tilde{q}=\left\lfloor\frac{a\left\lfloor\frac{2^{k} \mathrm{k}_{b \mathrm{R}}}{q}\right\rceil}{2^{k} \mathrm{R}}\right\rceil$ for sufficiently large $k$ implementing $\mathrm{B}=q$. We instead relax the bound B and relate the degree relaxation to the quality of approximation implemented by $\tilde{q}$. Conceptually, we throw away the lower-limb computation while computing $\tilde{q}$ by emulating multiply-high and show that the resulting bound B is still good enough for our use case.

Contributions. We summarize our contributions as follows.

- We show that Barrett multiplication is more favorable than Montgomery multiplication when multiply-high and multiply-long instructions are unavailable or unusable.
- We generalize the notion of integer approximations and show that the approximation nature of Barrett multiplication enables efficient emulation of multiply-highs. Along with this generalization, we find that Barrett multiplication performs the same as a long multiplication on Cortex-M3. This shows that any modular multiplication calling at least one long multiplication with non-zero preprocessing/postprocessing cost performs worse than our Barrett multiplication on Cortex-M3.
- We apply our ideas to the post-quantum digital signature Dilithium $\left[\mathrm{ABD}^{+} 20\right]$ recently standardized by NIST [NIS] on Cortex-M3 and 8-bit AVR. Compared to the state-of-the-art optimized Cortex-M3 implementation by [GKS21], our constant-time Barrett-based NTT/iNTT is $1.38-1.51$ times faster than their Montgomery-based implementation, and our variable-time Barrett-based NTT/iNTT is $1.10-1.21$ times faster than their variable-time Montgomery-based implementations. In an 8-bit AVR environment, our C implementation of Barrett-based NTT/iNTT are 6.37-7.27 times faster than the reference implementation $\left[\mathrm{ABD}^{+} 20\right]$ using Montgomery arithmetic. Additionally, our assembly implementation maximizes performance improvements (14.10-14.42 times faster).
- For the overall scheme, we provide speed-optimized implementations for Dilithium parameter sets dilithium2 and dilithium3 on Cortex-M3, and stack-optimized implementations for all parameter sets on Cortex-M3 and 8-bit AVR. We briefly compare the performance of dilithium3. Compared to the state-of-the-art assembly implementation on Cortex-M3 by [GKS21], our speed-optimized assembly implementation reduces the key generation, signature generation, and signature verification cycles by $2.30 \%, 23.29 \%$, and $0.69 \%$. As for the stack-optimized implementation, we compare our C and assembly implementations since there are no publicly available
implementations. Our assembly-optimized implementation reduces the cycles of key generation, signature generation, and signature verification of our C implementation by $12.96 \%, 27.00 \%$, and $22.72 \%$, respectively. For the 8 -bit AVR environment, we compare our implementations with the reference C Montgomery-based implementation, since there are no prior works. Our Barrett-based dilithium3 C implementation reduces the cycles of key generation, signature generation, and signature verification by $45.09 \%, 56.80 \%$, and $50.40 \%$, respectively, and our assembly-optimized Barrettbased implementation reduces the cycles of key generation, signature generation, and signature verification by $48.85 \%, 61.70 \%$, and $55.08 \%$, respectively.

Source code. Our source code will be publicly available at https://github.com/ vincentvbh/Barrett_Dilithium_Embedded.

Structure of this paper. This paper is structured as follows. Section 2 goes through the preliminaries, Section 3 describes our insights on modular multiplications, and Section 4 details our implementations of NTT/iNTT on Cortex-M3 and 8-bit AVR. Finally, Section 5 shows the performance numbers of NTT/iNTT and the overall impact on Dilithium.

## 2 Preliminiaries

Section 2.1 describes Dilithium $\left[\mathrm{ABD}^{+} 20\right]$, Section 2.2 reviews number-theoretic transform, Section 2.3 reviews integer approximations from $\left[\mathrm{BHK}^{+} 22\right]$, Section 2.4 reviews modular multiplications. Section 2.5 describes Cortex-M3, and Section 2.6 describes 8-bit AVR.

### 2.1 Dilithium

Dilithium $\left[\mathrm{ABD}^{+} 20\right]$ is a digital signature recently standardized by NIST [NIS]. It is based on the Module Small Integer Solutions (M-SIS) and the Module Learning With Errors (MLWE) problems. The module is a $k \times \ell$ matrix over the polynomial ring $\mathbb{Z}_{q}[x] /\left\langle x^{256}+1\right\rangle$ where $q=2^{23}-2^{13}+1$ is a prime and $(k, \ell)=(4,5),(6,5),(8,7)$, depending on the security level. Please see Table 4 for an overview of parameter sets and $\left[\mathrm{ABD}^{+} 20\right]$ for algorithm description.

Table 4: Dilithium parameter sets.

| Parameter set | NIST security level | $(k, \ell)$ |
| :--- | ---: | :---: |
| dilithium2 | II | $(4,4)$ |
| dilithium3 | III | $(6,5)$ |
| dilithium5 | V | $(8,7)$ |

The core operation of key generation, signature generation, and signature verification is the $(k \times \ell) \times(\ell \times 1)$ matrix-to-vector multiplications. Dilithium builds NTT into the specification - the matrix is sampled as if all the entries are ready in the domain of NTT.

### 2.2 Number Theoretic Transform

Let $R$ be a commutative ring with identity. For an $n$-th root of unity $\omega_{n} \in R$, we call it principal if $\forall j=1, \ldots, n-1, \sum_{i=0}^{n-1} \omega_{n}^{i j}=0$. For an invertible element $\zeta \in R$, we have the following isomorphism by the Chinese remainder theorem for polynomial rings:

$$
\frac{R[x]}{\left\langle x^{n}-\zeta^{n}\right\rangle} \cong \prod_{i} \frac{R[x]}{\left\langle x-\zeta \omega_{n}^{i}\right\rangle} .
$$

If $n=2^{h}$ is a power of two, we have

$$
\frac{R[x]}{\left\langle x^{n}-\zeta^{n}\right\rangle} \cong \prod \frac{R[x]}{\left\langle x^{\frac{n}{2}} \pm \zeta^{\frac{n}{2}}\right\rangle} \cong \ldots \cong \prod_{i_{0}, \ldots, i_{h-1}=0,1} \frac{R[x]}{\left\langle x-\zeta \omega_{n}^{\sum_{j} i_{j} 2^{j}}\right\rangle}
$$

This is the radix-2 Cooley-Tukey FFT for a discrete weighted transform [CT65, CF94]. In this paper, we specialize it to $\zeta=\omega_{2 n}$ so $\zeta^{n}=-1$.

We give an example for $n=2$. Let's define $\mathrm{CT}(-,-, \zeta)$ as the function $\left(a_{0}, a_{1}\right) \mapsto$ $\left(a_{0}+\zeta a_{1}, a_{0}-\zeta a_{1}\right)$. Clearly, we can implement the isomorphism $R[x] /\left\langle x^{2}-\zeta^{2}\right\rangle \cong$ $\prod R[x] /\langle x \pm \zeta\rangle$ as $a_{0}+a_{1} x \mapsto \mathrm{CT}\left(a_{0}, a_{1}, \zeta\right)$. If we further define GS $(-,-, \zeta):=\left(\hat{a}_{0}, \hat{a}_{1}\right) \mapsto$ $\left(\hat{a}_{0}+\hat{a}_{1},\left(\hat{a}_{0}-\hat{a}_{1}\right) \zeta\right)$, then $\operatorname{GS}\left(-,-, \zeta^{-1}\right) \circ \mathrm{CT}(-,-, \zeta)=\mathrm{CT}(-,-, \zeta) \circ \operatorname{GS}\left(-,-, \zeta^{-1}\right)=$ $\left(a_{0}, a_{1}\right) \mapsto 2\left(a_{0}, a_{1}\right)$ where o stands for function composition. Generally, we call $\mathrm{CT}(-,-,-)$ a Cooley-Tukey butterfly (CT butterfly) and GS $(-,-,-)$ a Gentleman-Sande butterfly (GS butterfly). See Figure 1 for illustrations.
$a_{0}$

$a_{0}+\zeta a_{1}$
(a) Cooley-Tukey butterfly $\operatorname{CT}\left(a_{0}, a_{1}, \zeta\right)$.
(b) Gentleman-Sande butterfly $\operatorname{GS}\left(\hat{a}_{0}, \hat{a}_{1}, \zeta^{-1}\right)$.

Figure 1: Radix-2 butterflies, adapted from [AHY22].

Since a size- $n$ Cooley-Tukey FFT can be implemented entirely with CT butterflies, we can invert the computation by replacing all CT butterflies with GS butterflies and canceling out the scaling at the end. Gentleman-Sande FFT [GS66] proceeds in a different way - we first convert $R[x] /\left\langle x^{m}-\psi^{m}\right\rangle$ into $R[y] /\left\langle y^{m}-1\right\rangle$ whenever $\psi^{m} \neq 1$ and split with $R[y] /\left\langle y^{m}-1\right\rangle \cong \prod R[y] /\left\langle y^{\frac{m}{2}} \pm 1\right\rangle$. Since the result of Cooley-Tukey FFT for $R[x] /\left\langle x^{n}-\zeta^{n}\right\rangle$ can be implemented entirely with GS butterflies, we can also invert it with CT butterflies. The benefit for inverting with CT butterflies instead of GS butterflies when $\zeta^{n} \neq 1$ is that while multiplying by $n^{-1}$ for canceling out the scaling, we can merge $n-1$ of them with multiplications by $\zeta^{-i}$ if we implement with CT butterflies whereas implementing with GS butterflies only allow us to merge $\frac{n}{2}$ of them with $\zeta^{-i}$. We refer to $\left[\mathrm{ACC}^{+} 22\right.$, Figure 1] for illustrations.

### 2.3 Integer Approximation

For a function $\llbracket \rrbracket: \mathbb{R} \rightarrow \mathbb{Z},\left[\mathrm{BHK}^{+} 22\right]$ call it an integer approximation if

$$
\forall r \in \mathbb{R},|r-\llbracket r \rrbracket| \leq 1
$$

Common examples are the floor function $\rfloor$, ceiling function $\rceil$, and rounding-half-up function LT. $\left[\mathrm{BHK}^{+} 22\right]$ chose $\left\rceil_{2}:=r \mapsto 2\left\lfloor\frac{r}{2}\right\rceil\right.$ and demonstrated its benefit for the vector instruction set Neon in Armv8-A. It is easily seen that $\rfloor,\lceil \rceil, L\rceil, L\rceil_{2}$ are all integer approximations.

### 2.4 Modular Multiplications

Throughout this paper, we consider $\mathrm{R}=2^{32}$ and $q<\frac{\mathrm{R}}{2}$ an odd number, and focus on signed arithmetic. For an integer approximation $\mathbb{\rrbracket}$, we define the corresponding modular reduction $\bmod { }^{\llbracket \rrbracket} q: \mathbb{Z} \rightarrow \mathbb{Z}$ as

$$
\bmod ^{\mathbb{1}} q:=z \mapsto z-\llbracket \frac{z}{q} \rrbracket q .
$$

Furthermore，we define $\left|\bmod { }^{\mathbb{\rrbracket}} q\right|:=\max _{z \in \mathbb{Z}}\left|z \bmod \mathbb{\square}^{\mathbb{1}} q\right|$ ．If $\mathbb{\llbracket}=\lfloor \rceil$ ，we denote $\bmod { }^{[1}$ as $\bmod ^{ \pm}$。

Montgomery multiplication．Let $a, b \in\left[-\frac{\mathrm{R}}{2}, \frac{\mathrm{R}}{2}\right)$ be two integers．Montgomery multi－ plication［Mon85，Sei18］computes a representative of $a b \mathrm{R}^{-1} \bmod q$ as

$$
\frac{a b+\left(a b\left(-q^{-1}\right) \bmod { }^{ \pm} \mathrm{R}\right) q}{\mathrm{R}} \equiv a b \mathrm{R}^{-1} \quad(\bmod q)
$$

If $b$ is known in prior，we replace $b$ with $b \mathrm{R} \bmod { }^{ \pm} q$ and compute

$$
\frac{a\left(b \mathrm{R} \bmod { }^{ \pm} q\right)+\left(a\left(b \mathrm{R} \bmod { }^{ \pm} q\right)\left(-q^{-1}\right) \bmod { }^{ \pm} \mathrm{R}\right) q}{\mathrm{R}} \equiv a b \quad(\bmod q)
$$

Since $\left|\frac{a\left(b \mathrm{R}_{\bmod }{ }^{ \pm} q\right)+\left(a\left(b \mathrm{R}_{\bmod }{ }^{ \pm} q\right)\left(-q^{-1}\right) \bmod { }^{ \pm} \mathrm{R}\right) q}{\mathrm{R}}\right| \leq \frac{|a|\left|\bmod ^{ \pm} q\right|+\left|\bmod ^{ \pm} \mathrm{R}\right| q}{\mathrm{R}}=\frac{q}{2}\left(1+\frac{|a|}{\mathrm{R}}\right)$ ， the result is a 32 －bit value．

In［Sei18］，they proposed the following subtractive variant for vector arithmetic：

$$
\left\lfloor\frac{a b}{\mathrm{R}}\right\rfloor-\left\lfloor\frac{\left(a b q^{-1} \bmod ^{ \pm} \mathrm{R}\right) q}{\mathrm{R}}\right\rfloor
$$

This sometimes improves the overall performance since we don＇t need to keep track of the lower $\log _{2} R$ bits of the products．

Barrett multiplication．Barrett multiplication was first introduced only for the reduc－ tion form［Bar86，Sei18］－For reducing a value $a$ ，we compute with

$$
a-\left\lfloor\frac{\left.a \left\lvert\, \frac{\mathrm{R}}{q}\right.\right\rceil}{\mathrm{R}}\right\rceil q .
$$

［Sho］proposed the multiplicative form for unsigned arithmetic，and［ $\left.\mathrm{BHK}^{+} 22\right]$ proposed the signed multiplication with integer approximations．［ $\mathrm{BHK}^{+} 22$ ］computed a representative of $a b \bmod q$ as

$$
a b-\left\lfloor\frac{a\left|\frac{b \mathrm{R}}{q}\right|}{\mathrm{R}}\right\rceil q .
$$

$\left[\mathrm{BHK}^{+} 22\right]$ showed that the result is a 32 －bit value by establishing a correspondence between Barrett and Montgomery multiplications．

A correspondence between Barrett and Montgomery multiplications．［ $\mathrm{BHK}^{+} 22$ ］ showed that for an integer approximation 【】，we have

$$
a b-\left\lfloor\frac{a \llbracket \frac{b \mathrm{R}}{q} \rrbracket}{\mathrm{R}}\right\rceil q=\frac{a\left(b \mathrm{R} \bmod \mathbb{\rrbracket}^{\mathbb{1}} q\right)+\left(a\left(b \mathrm{R} \bmod { }^{\llbracket} \mathbb{\rrbracket} q\right)\left(-q^{-1}\right) \bmod { }^{ \pm} \mathrm{R}\right) q}{\mathrm{R}} .
$$

Their proof clearly transfers to the following generalization：For integer approximations $\llbracket \rrbracket_{0}, \llbracket \rrbracket_{1}$ ，we have

$$
a b-\llbracket \frac{a \llbracket \frac{b \mathrm{R}}{q} \rrbracket_{0}}{\mathrm{R}} \rrbracket_{1} q=\frac{a\left(b \mathrm{R} \bmod \mathbb{\rrbracket}_{0} q\right)+\left(a\left(b \mathrm{R} \bmod \mathbb{\rrbracket}_{0} q\right)\left(-q^{-1}\right) \bmod \mathbb{I}_{1} \mathrm{R}\right) q}{\mathrm{R}} .
$$

We leave the justification to readers since it is completely routine by unfolding the definitions. The correspondence allows us to argue the output range as follows:

$$
\begin{aligned}
& \qquad\left|a b-\llbracket \frac{a \llbracket \frac{b \mathrm{R}}{q} \rrbracket_{0}}{\mathrm{R}} \rrbracket_{1} q\right| \leq \frac{|a|\left|\bmod \mathbb{1}_{0} q\right|+\left|\bmod \mathbb{\rrbracket}_{1} \mathrm{R}\right| q}{\mathrm{R}} . \\
& \text { If } \llbracket \rrbracket_{0}=\llbracket \rrbracket_{1}=\lfloor \rceil \text {, we have }\left|a b-\llbracket \frac{a \llbracket \frac{b \mathrm{R}}{q} \rrbracket_{0}}{\mathrm{R}} \rrbracket_{1} q\right| \leq \frac{q}{2}\left(1+\frac{|a|}{\mathrm{R}}\right) .
\end{aligned}
$$

### 2.5 Cortex-M3

Cortex-M3 implements the instruction set architecture Armv7-M. We briefly describe relevant instructions in Armv7-M [ARM21] and their timing on Cortex-M3 [ARM10]. add adds up two 32 -bit values and sub subtracts them. adc and sbc add and subtract the values with carry. lsl and lsr logically shift a 32 -bit value left and right by the specified constant/register value. asr performs an arithmetic right-shift. ubfx extracts certain consecutive bits and unsigned-extends the result to a 32 -bit value. sbfx signed-extends the result to a 32 -bit value. Each of the above instructions takes one cycle (we exclude the instruction timing involving PC operands). mul multiplies two 32 -bit values, mla accumulates the product to the accumulator, and mls subtracts the product from the accumulator. mul takes one cycle and $\mathrm{mla} / \mathrm{mls}$ takes two cycles. $\{\mathrm{u}, \mathrm{s}\}$ mull computes the 64 -bit unsigned/signed product of two 32 -bit values, and $\{u, s\} m l a l$ accumulates the product to an accumulator. $\{\mathrm{u}, \mathrm{s}\}\{\mathrm{mul}, \mathrm{mla}\} \mathrm{l}$ takes 3 to 7 cycles and is inputdependent [ARM10, Table 18-1].

### 2.6 8-Bit AVR

The 8-bit AVR microcontroller architecture employs a straightforward two-stage pipeline. Most of its instructions execute in a single cycle. Internally, it is equipped with 32 general-purpose 8 -bit registers, designated as [r0:r31]. Due to this, basic arithmetic operations, including bit operations, are performed on 8 -bit units. We briefly describe relevant instructions in 8-bit AVR environment [Atm16]. Analogous to the fundamental Cortex-M3, it supports 8-bit unit operations like add, sub, adc, and sbc. lsl and lsr logically shift an 8 -bit value one bit to the left and right, respectively. asr performs an arithmetic 1-bit right-shift. Each of the above instructions takes one cycle. Excluding the early AVR architectures like the ATtiny series, which possess byte-sized Static RAM (SRAM), the AVR microcontrollers primarily accommodate multiplication instructions via a dedicated hardware multiplication unit. The product of the multiplication is always returned in [r0:r1]. mul multiplies two unsigned 8-bit values, while muls multiplies two signed 8 -bit values. mulsu multiplies 8 -bit signed and unsigned value. These multiplication instructions take two cycles. Unlike mul, which allows all registers as operands, both muls and mulsu mandate the use of registers within the [r16:r31] range as operands.

## 3 Barrett Multiplication: Revisited

### 3.1 Integer Approximation: Revisited

We generalize the notion of integer approximations. For a function $\llbracket \rrbracket: \mathbb{R} \rightarrow \mathbb{Z}$, we call it an integer approximation if

$$
\exists \delta \in \mathbb{R}_{>0}, \forall r \in \mathbb{R},|r-\llbracket r \rrbracket| \leq \delta
$$

When $\delta$ is known, we call $\llbracket \rrbracket$ a $\delta$-integer-approximation. The generalizations of mod ${ }^{\mathbb{\square}}$ and $\left|\bmod ^{[\sqrt{~}} q\right|$ are defined in the same way.

### 3.2 Barrett Multiplication with Approximated High Products

Let $a \in\left[-\frac{\mathrm{R}}{2}, \frac{\mathrm{R}}{2}\right), b \in\left[-\frac{q}{2}, \frac{q}{2}\right)$ be integers. Recall that Barrett multiplication computes a representative of $a b \bmod q$ as

$$
a b-\llbracket \frac{a \llbracket \frac{b \mathrm{R}}{q} \rrbracket_{0}}{\mathrm{R}} \rrbracket_{1} q
$$

for integer approximations $\llbracket \rrbracket_{0}, \llbracket \rrbracket_{1}$. We choose $\llbracket \rrbracket_{0}=\lfloor \rceil$ and $\llbracket \rrbracket_{1}$ the following integer approximation:

$$
\forall r \in \mathbb{R}, \llbracket r \rrbracket_{1}:=\left\lfloor\frac{a_{l} b_{h}}{\sqrt{\mathrm{R}}}\right\rfloor+\left\lfloor\frac{a_{h} b_{l}}{\sqrt{\mathrm{R}}}\right\rfloor+a_{h} b_{h}
$$

where $a_{l}+a_{h} \sqrt{\mathrm{R}}=\frac{r \mathrm{R}}{\llbracket \frac{b \mathrm{R}}{q} \rrbracket_{0}}, b_{l}+b_{h} \sqrt{\mathrm{R}}=\llbracket \frac{b \mathrm{R}}{q} \rrbracket_{0}$ and $a_{l}, b_{l} \in[0, \sqrt{\mathrm{R}})$. Observe that

$$
\forall r \in \mathbb{R},\left|\llbracket r \rrbracket_{1}-\lfloor r\rceil\right| \leq 3,
$$

we have

$$
\left|a b-\llbracket \frac{a\left\lfloor\frac{b \mathrm{R}}{q} \rrbracket_{0}\right.}{\mathrm{R}} \rrbracket_{1} q\right| \leq\left|a b-\left\lfloor\frac{a\left\lfloor\frac{b \mathrm{R}}{q} \rrbracket_{0}\right.}{\mathrm{R}}\right\rceil q\right|+3 q \leq \frac{q}{2}\left(7+\frac{|a|}{\mathrm{R}}\right) .
$$

Therefore, computing with $a b-\llbracket \frac{a \llbracket \frac{b \mathrm{R}}{q} \rrbracket_{0}}{\mathrm{R}} \rrbracket_{1} q$ is tolerable as long as $\frac{q}{2}\left(7+\frac{|a|}{\mathrm{R}}\right)<\frac{\mathrm{R}}{2}$. In Dilithium, this is the case since $q=2^{23}-2^{13}+1$ and $R=2^{32}$. The benefit is that $\llbracket \rrbracket_{1}$ is faster than $\left\rceil\right.$ if we have to emulate them with $\frac{\log _{2} R}{2}=\frac{\log _{2} R}{2} \times \frac{\log _{2} R}{2}$ multiply-low instructions. The same argument holds if we only have $\frac{\log _{2} R}{4}=\frac{\log _{2} R}{4} \times \frac{\log _{2} R}{4}$ multiply-low instructions.

## 4 Implementations

Section 4.1 details our implementations of modular multiplications and Section 4.2 describes our layer-merging strategies.

### 4.1 Modular Multiplications

We first describe our implementations of modular multiplications. Section 4.1.1 describes our Cortex-M3 implementations and Section 4.1.2 describes our AVR implementations.

### 4.1.1 Cortex-M3

For our Armv7-M implementation on Cortex-M3, we detail our variable-time and constanttime Barrett multiplications and compare them with the state-of-the-art Montgomery multiplications by [GKS21]. Table 5 summarizes the performance cycles of various multiplication operations. Since our Barrett multiplication, along with our observation of the approximation nature, performs the same as the emulated long multiplication, we find that: Our Barrett multiplication outperforms any modular multiplication algorithm calling an emulated long multiplication followed by a reduction subroutine with non-zero cost on Cortex-M3.

Table 5: Overview of multiplication operations on Cortex-M3. All implementations are constant-time unless stated otherwise. We assume that inputs are 32-bit register values. Further optimizations may be applied by merging the decomposition with the memory operations. However, this complicates the comparisons when the layer-merging technique is applied on Cortex-M3.

| Multiplication operation | Work | Cycle |
| :--- | :--- | ---: |
| Long (variable-time) | $[$ ARM10 $]$ | $3-5$ |
| Long (constant-time) | $[$ GKS21 $]$ | 12 |
| Montgomery multiplication (variable-time) | $[$ GKS21 | $7-11$ |
| Montgomery multiplication (constant-time) | $[$ GKS21 $]$ | 23 |
| Barrett multiplication (variable-time) | This work | $6-8$ |
| Barrett multiplication (constant-time) | This work | 12 |

Variable-time Barrett multiplication. For 32-bit values $a, b$ with $b \in\left\{-\frac{q}{2}, \ldots, \frac{q}{2}\right\}$ known, we precompute $\left[\frac{2^{32} b}{q}\right\rceil$ and compute

$$
a b-\left\lfloor\frac{a\left\lfloor\frac{2^{32} b}{q}\right\rceil}{2^{32}}\right\rfloor q
$$

as a representative of $a b \bmod q$. This is our variable-time Barrett multiplication and Algorithm 1 is an illustration. Since the images of $\rfloor$ and $\rceil$ differ by at most 1 , we have

$$
\left|a b-\left\lfloor\frac{a\left\lfloor\frac{2^{32} b}{q}\right\rceil}{2^{32}}\right\rfloor q\right| \leq\left|a b-\left\lfloor\frac{a\left\lfloor\frac{2^{32} b}{q}\right\rceil}{2^{32}}\right\rceil q\right|+q \leq \frac{q}{2}\left(3+\frac{|a|}{2^{32}}\right) .
$$

Comparing to the state-of-the-art variable-time Montgomery multiplication [GKS21, $\mathrm{ACC}^{+} 21$ ] (cf. Algorithm 2), Barrett multiplication turns the smlal into an mls. Since smlal takes 3 to 7 cycles and mls takes only two cycles, Barrett multiplication is obviously faster.

```
Algorithm 1 Variable-time Barrett multiplication on Cortex-M3.
Inputs: \(\mathrm{a}=a, \mathrm{~b}=b\), bhi \(=\left\lfloor\frac{2^{32} b}{q}\right\rceil\).
Outputs: \(\mathrm{c}=a b-\left\lfloor\frac{a\left\lfloor\frac{2^{32}}{q}\right\rceil}{2^{32}}\right\rfloor q\).
smull lo, hi, a, bhi
mul \(c, a, b \quad \triangleright c=a b \bmod { }^{ \pm} 2^{32}\).
mls \(\mathrm{c}, \mathrm{hi}, q, \mathrm{c}\)
\(\triangleright 1 \mathrm{o}+\mathrm{hi} \cdot 2^{32}=a\left\lfloor\frac{2^{32} b}{q}\right\rceil\).
\(\triangleright \mathrm{c}=a b-\left\lfloor\frac{a\left\lfloor\frac{2^{32}}{q}\right\rceil}{2^{32}}\right\rfloor q\).
```

```
Algorithm 2 Variable-time Montgomery multiplication on Cortex-M3 [GKS21, ACC \({ }^{+}\)21].
Inputs: \(\mathrm{a}=a, \mathrm{~b}=b\).
Outputs: \(\mathrm{hi}=\frac{a b+\left(-a b q^{-1} \bmod { }^{ \pm} 2^{32}\right) q}{2^{32}}\).
    smull lo, hi, a, b \(\quad \triangleright\) lo \(+\mathrm{hi} \cdot 2^{32}=a b\).
    mul lo, lo, \(-q^{-1} \bmod { }^{ \pm} 2^{32} \quad \triangleright l o=-a b q^{-1} \bmod { }^{ \pm} 2^{32}\).
    smlal lo, hi, lo, \(q\)
\(\triangleright \mathrm{hi}=\frac{a b+\left(-a b q^{-1} \bmod { }^{ \pm} 2^{32}\right) q}{2^{32}}\).
```

Constant-time Barrett multiplication. For the constant-time Barrett multiplication, we again precompute $\left\lfloor\frac{2^{32} b}{q}\right\rceil$ if $b$ is known. Then, we compute

$$
a b-\llbracket \frac{a\left\lfloor\left.\frac{2^{32} b}{q} \right\rvert\,\right.}{2^{32}} \rrbracket_{1} q
$$

as a representative of $a b \bmod q$. Algorithm 3 is an illustration. For computing $\llbracket \frac{a\left\lfloor\frac{2^{32_{b}}}{q}\right\rceil}{2^{32}} \rrbracket_{1}$, we implement the macro mulhi_split as shown in Algorithm 7. In summary, our constanttime Barrett multiplication takes 10 cycles whereas constant-time Montgomery multiplication takes 19 cycles (cf. Algorithm 4).

```
Algorithm 3 Constant-time Barrett multiplication on Cortex-M3.
Inputs: \(\mathrm{a}=a, \mathrm{~b}=b\), \(\mathrm{blo}+\mathrm{bhi} \cdot 2^{16}=\left\lfloor\frac{2^{32} b}{q}\right\rceil\).
Outputs: \(\mathrm{t} 3=a b-\llbracket \frac{a\left\lfloor\frac{2^{32_{b}}}{q}\right\rceil}{2^{32}} \rrbracket_{1} q\).
    mul \(\quad \mathrm{t} 3, \mathrm{a}, \mathrm{b} \quad \triangleright \mathrm{t} 3=a b \bmod { }^{ \pm} 2^{32}\).
    ubfx t0, a, \#0, \#16
    asr \(\mathrm{a}, \mathrm{a}, \# 16 \quad \triangleright \mathrm{t} 0+\mathrm{a} \cdot 2^{16}=a\).
    mulhi_split t1, a, bhi, t0, blo, t2
    \(\mathrm{t} 3, \mathrm{t} 1, \quad q, \quad \mathrm{t} 3\)
    \(\triangleright \mathrm{t} 3=a b-\llbracket \frac{a\left\lfloor\frac{2^{32} b}{q}\right\rceil}{2^{32}} \rrbracket_{1} q\).
```

```
Algorithm 4 Constant-time Montgomery multiplication on Cortex-M3 [GKS21].
Inputs: \(\mathrm{al}+\mathrm{ah} \cdot 2^{16}=a, \mathrm{bl}+\mathrm{bh} \cdot 2^{16}=b\).
Outputs: res \(=\frac{a b+\left(-a b q^{-1} \bmod { }^{ \pm} 2^{32}\right) q}{2^{32}}\).
    sbsmull al, res, al, ah, bl, bh, tmp0
                                    \(\triangleright \mathrm{al}+\mathrm{res} \cdot 2^{32}=a b, 7\) cycles [GKS21, Listing 5].
    mul bh, al, \(-q^{-1} \bmod { }^{ \pm} 2^{32} \quad \triangleright \mathrm{bh}=-a b q^{-1} \bmod { }^{ \pm} 2^{32}\).
    \(\begin{array}{lll}\text { ubfx bl, bh, \#0, \#16 } \\ \text { asr } & \text { bh, bh, \#16 }\end{array} \quad \triangleright \mathrm{bl}+\mathrm{bh} \cdot 2^{16}=-a b q^{-1} \bmod { }^{ \pm} 2^{32}\).
    sbsmlal al, res, bl, bh, ql, qh, tmp0
                        \(\triangleright \mathrm{res}=\frac{a b+\left(-a b q^{-1} \bmod { }^{ \pm} 2^{32}\right) q}{2^{32}}, 9\) cycles [GKS21, Listing 6].
```


### 4.1.2 8-Bit AVR

Since there is no research for Dilithium in an 8-bit AVR environment, we directly implement and compare Montgomery and Barrett multiplications. Specifically, our work encompasses the full range of $16 / 32$-bit multiply-low/high/long macros operating at the granular level of 8 -bit words. In summary, our constant-time Barrett multiplication takes 129 cycles, whereas constant-time Montgomery multiplication takes 184 cycles on an 8-bit AVR environment. Table 6 gives an overview of the multiplication operations.

Table 6: Overview of multiplication operations on 8-bit AVR.

| Multiplication operation | Work | Cycle |
| :--- | :--- | ---: |
| mulsu_16x16_32 | $[$ Ret21] | 17 |
| muls_16x16_32 | [Ret21] | 18 |
| muls32xQ_lo32 | This work | 23 |
| muls32x32_1o32 | [Ret21] | 36 |
| muls32xQinv_lo32 | This work | 27 |
| muls32xQ_hi32 | This work | 51 |
| muls32x32_64 | [Ret21] | 102 |
| Montgomery multiplication | This work | 184 |
| Barrett multiplication | This work | 129 |

For the Barrett multiplication (cf. Algorithm 5), we implement two 16-bit multiplylong macros (muls16x16_32 and mulsu16x16_32) and two 32-bit multiply-low macros (muls32xQ_lo32 and muls32x32_lo32). We implement the multiply instructions by leveraging the "Move-and-Add" (MA) technique, as proposed in $\left[\mathrm{LSSR}^{+} 15\right]$ and referenced in [Ret21]. The macro muls16x16_32 multiplies two signed 16 -bit values, and macro mulsu16x16_32 multiplies 16 -bit signed and unsigned value. If one of the operands of 16 -bit multiply-long is unsigned, we can save one sbc instruction. Thus, mulsu_16x16_32 ( 17 cycles) is 1 cycle faster than muls $16 \times 16 \_32$ ( 18 cycles). We further optimize the 32 -bit multiply-low when $q$ is one of the operands by observing that the least significant 8 -bit of $q$ are 1's. Since the least significant word is 1 , there's no need to operate carry propagation due to the signed representation. As a result, we implement solely using the mul commands (no muls and mulsu). Please see Algorithm 8 for an illustration. Our muls32xQ_lo32 (23 cycles) is 13 cycles faster than the generic muls32x32_1o32 (36 cycles).

```
Algorithm 5 Constant-time Barrett multiplication on 8-bit AVR.
Inputs: \((\mathrm{a} 3\|\cdots\| \mathrm{a} 0)=a,(\mathrm{~b} 3\|\cdots\| \mathrm{b} 0)=b,(\mathrm{bp} 3\|\cdots\| \mathrm{bp} 0)=\left\lfloor\frac{2^{32} b}{q}\right\rceil\).
    \(\left((\mathrm{a} 3 \| \mathrm{a} 2)=a_{h}, \quad(\mathrm{a} 1 \| \mathrm{a} 0)=a_{l}, \quad(\mathrm{bp} 3 \| \mathrm{bp} 2)=b_{h}, \quad(\mathrm{bp} 1 \| \mathrm{bp} 0)=b_{l}\right)\)
```

Outputs: $(\mathrm{c} 3\|\cdots\| \mathrm{c} 0)=c=a b-\llbracket \frac{a\left\lfloor\frac{2^{32} b}{q}\right.}{2^{32}} \rrbracket_{1} q$
muls16x16_32 a2, a3, bp2, bp3, c0, $\cdots, \mathrm{c} 3 \quad \triangleright c=a_{h} b_{h}$.
mulsu16x16_32 bp2, bp3, a0, a1, t0, $\cdots$, t3 $\triangleright t=a_{l} b_{h}$.
mov r0, t3 lsl r0 sbc r0, r0 $\triangleright$ r0 $=\operatorname{SignExtend}(\mathrm{t} 3[7: 7])$.
add c0, t2 adc c1, t3 adc c2, r0 adc c3, r0 $\triangleright c=a_{h} b_{h}+\left\lfloor\frac{a_{l} b_{h}}{2^{16}}\right\rfloor$.
mulsu16x16_32 a2, a3, bp0, bp1, t0, $\cdots$, t3 $\triangleright t=a_{h} b_{l}$.
mov r0, t3 lsl r0 sbc r0, r0 $\quad$ r0 $=\operatorname{SignExtend}(\mathrm{t} 3[7: 7])$.
add c0, t2 adc c1, t3 adc c2, r0 adc c3, r0
$\triangleright c=a_{h} b_{h}+\left\lfloor\frac{a_{l} b_{h}}{2^{16}}\right\rfloor+\left\lfloor\frac{a_{h} b_{l}}{2^{16}}\right\rfloor=\llbracket \frac{a\left\lfloor\frac{2^{32 b}}{q}\right\rceil}{2^{32}} \rrbracket_{1}$.
muls32xQ_lo32 c0, $\cdots, \mathrm{c} 3$, qimm, t0, $\cdots$, t3 $\triangleright t=\llbracket \frac{a\left\lfloor\frac{2^{32} b}{q}\right\rceil}{2^{32}} \rrbracket_{1} q, 23$ cycles
muls32x32_lo32 a0, $\cdots, \mathrm{a} 3, \mathrm{~b} 0, \cdots, \mathrm{~b} 3, \mathrm{c} 0, \cdots, \mathrm{c} 3 \quad \triangleright r=a b \bmod { }^{ \pm} 2^{32}$.
sub c0, t0 sbc c1, t1 sbc c2, t2 sbc c3, t3
$\triangleright(\mathrm{c} 3\|\cdots\| \mathrm{c} 0)=c=a b-\llbracket \frac{a\left\lfloor\frac{2^{32} b}{q}\right\rceil}{2^{32}} \|_{1} q$.

Conversely, Montgomery multiplication based on [Sei18] (cf. Algorithm 6) requires one 32 -bit multiply-low, one 32-bit multiply-high, and one 32 -bit multiply-long macros. Since the least significant word of $q$ and $q^{-1} \bmod { }^{ \pm} 2^{32}$ in Dilithium is 1 , we implement the

32-bit multiply-low/high instructions (muls32xQinv_lo32 and muls32xQ_hi32) similarly to Algorithm 8. Macros muls32xQinv_lo32 and muls32xQ_hi32 take 27 and 51 cycles, respectively. As the 64 -bit $a b$ needs to hold be in the register, we use row-wise multiplication $\left[\mathrm{GPW}^{+} 04\right]$ technique for 32 -bit multiply-long instructions. The macro muls32x32_64 takes 102 cycles.

```
\(\overline{\text { Algorithm } 6}\) Our constant-time Montgomery multiplication on 8-bit AVR adapted
from [Sei18].
Inputs: \((\mathrm{a} 3\|\cdots\| \mathrm{a} 0)=a\), \((\mathrm{b} 3\|\cdots\| \mathrm{b} 0)=b\)
Outputs: \((r 7\|\cdots\| r 4)=\frac{a b-\left(a b q^{-1} \bmod { }^{ \pm} 2^{32}\right) q}{2^{32}}\).
    : muls32x32_64 a0, \(\cdots, \mathrm{a} 3, \mathrm{~b} 0, \cdots, \mathrm{~b} 3, \mathrm{r} 0, \cdots, \mathrm{r} 8 \quad \triangleright r=a b\).
    : muls32xQinv_lo32 r0, \(\cdot\), r3, qiimm, t0, \(\cdots\), t3
    \(\triangleright t=a b q^{-1} \bmod { }^{ \pm} 2^{32}, 27\) cycles.
    : muls32xQ_hi32 t0, \(\cdots\), t3, qimm, t4, \(\cdots\), t7
        \(\triangleright t=\frac{\left(a b q^{-1} \bmod ^{ \pm} 2^{32}\right) q}{2^{32}}, 51\) cycles.
    : sub r4, t4 sbc r5, t5 sbc r6, t6 sbc r7, t7
        \(\triangleright(\mathrm{r} 7\|\cdots\| \mathrm{r} 4)=\frac{a b-\left(a b q^{-1} \bmod { }^{ \pm} 2^{32}\right) q}{2^{32}}\).
```


### 4.2 Number Theoretic Transforms

Section 4.2.1 describes our Cortex-M3 NTT/iNTT implementations and Section 4.2.2 describes our AVR NTT/iNTT implementations.

### 4.2.1 Cortex-M3

We employ the same 2-2-2-2 layer merging strategy for variable-time NTT/iNTT as [GKS21]. As for the constant-time NTT, we compute one layer at a time due to the high register pressure. For the constant-time iNTT, the first three layers are merged as follows: we load four coefficients, apply all butterflies, load the other four coefficients, and apply all butterflies with twiddle factors 1 or $\omega_{4}$. The butterflies with twiddle factors $\omega_{8}$ and $\omega_{8}^{3}$ are computed separately. For the remaining layers, we compute one layer at a time.

### 4.2.2 8-Bit AVR

Unlike the ARM architecture, in the 8-bit AVR environment, the displacement of the load indirect (ldd) instruction is limited to [ 0,63 ]. Thus, one can access data up to 64 bytes from a base address. Accessing addresses beyond this range requires an additional 2 cycles (adiw). Furthermore, given the 32 registers, excluding the address register, there isn't sufficient space to store 4 coefficients and temporary values for merging 2 layers. As a result, in the AVR implementation, we do not employ a merging strategy and compute one layer at a time. For both NTT and iNTT, we utilize the CT butterfly. The Twiddle factors required for NTT, iNTT, and twisting are all stored in the flash memory.

## 5 Results

We present the performance numbers in this section. Section 5.1 describes our benchmarking environment, Section 5.2 shows the performance of NTTs/iNTTs, and Section 5.3 summarizes the overall performance of Dilithium.

### 5.1 Benchmarking Environment

Cortex-M3. We benchmark our Armv7-M implementations on a nucleo-f207zg board containing a stm32f207zg core with 128 KiB of SRAM and 1 MB of flash memory. According to [STM20, Sections 3.2 and 3.6], stm32f207zg provides accesses to SRAM and flash memory with 0 wait state up to the frequency 120 MHz . Nevetheless, we follow the literature $\left[\mathrm{ACC}^{+} 22\right]$ and benchmark at frequency 30 MHz for consistency. We compile our code with the cross-compiler arm-none-eabi-gcc version 10.3.1.

8-bit AVR. We benchmark our 8-bit AVR implementation using the IAR Embedded Workbench. We simulate them on the Generic Devices -v6 option with Max 16 MB of SRAM and 8 MB of flash memory. We compile our AVR codes with the compiler of IAR Embedded Workbench version 8.10.1 using High (speed) level optimization option. Since 8 -bit AVR comprise the single-pipeline structure, our simulations provide cycle counts equivalent to the benchmarks. To measure the stack size, we use the linker option (Enable stack usage analysis) of IAR Embedded Workbench, and the code size is measured through the information in the .map file.

### 5.2 Performance of Number Theoretic Transform

We first describe the performance of NTT/iNTT.

### 5.2.1 Cortex-M3

For Cortex-M3 implementations, we compare to the state-of-the-art assembly-optimized implementations by [GKS21]. Table 7 summarizes the performance of NTT/iNTT on Cortex-M3. We compare our negacyclic NTT to for the NTT by [GKS21], and the sum of cyclic iNTT and twisting to the iNTT by [GKS21]. Table 7 summarizes the numbers.

Our variable-time NTT and iNTT are $\frac{19347}{15985} \approx 1.21 \times$ and $\frac{21006}{14117+4950}=\frac{21006}{19067} \approx 1.10 \times$ faster than [GKS21], respectively. For constant-time implementations, our NTT and iNTT are $\frac{33025}{21876} \approx 1.51 \times$ and $\frac{36609}{19782+6742}=\frac{36609}{26524} \approx 1.38 \times$ faster than [GKS21], respectively.

Table 7: Performance numbers of NTT/iNTT on Cortex-M3.

| Function | [GKS21] |  | This work |  |
| :---: | :---: | :---: | :---: | :---: |
| NTT |  |  |  |  |
|  | Variable-time | Constant-time | Variable-time | Constant-time |
| NTT (negacyclic) | 19347 | 33025 | 15985 | 21876 |
| iNTT |  |  |  |  |
|  | Variable-time | Constant-time | Variable-time | Constant-time |
| iNTT (negacyclic) | 21006 | 36609 | - | - |
| iNTT (cyclic) | - | - | 14117 | 19782 |
| Point mul. | - | - | 4950 | 6742 |

### 5.2.2 8-Bit AVR

Since our implementation is the first work of Dilithium in the AVR environment, we compare it with the NIST reference implementation $\left[\mathrm{ABD}^{+} 20\right]$. Table 8 shows the comparison of the reference implementation using Montgomery arithmetic and our implementation using Barrett arithmetic. The results for our iNTT include ring twisting cycles. Our AVR Assembly implementation is a Constant-time implementation.

Our C-based NTT and iNTT are $\frac{2860881}{449457} \approx 6.37 \times$ and $\frac{3402491}{468207} \approx 7.27 \times$ faster than $\left[\mathrm{ABD}^{+} 20\right]$, respectively. For our hand-written assembly implementation, our NTT and iNTT are $\frac{2860881}{202917} \approx 14.10 \times$ and $\frac{3402491}{236028} \approx 14.42 \times$ faster than $\left[\mathrm{ABD}^{+} 20\right]$.

Table 8: Performance numbers of NTT/iNTT on 8-bit AVR.

| Function | $\left[\mathrm{ABD}^{+} 20\right]$ |  | This work |  |
| :---: | :---: | :---: | :---: | :---: |
| NTT |  |  |  |  |
|  | C | ASM | C | ASM |
| NTT (negacyclic) | 2860881 | - | 449457 | 202917 |
| iNTT |  |  |  |  |
|  | C | ASM | C | ASM |
| iNTT (negacyclic) | 3402491 | - | - | - |
| iNTT (cyclic+Point mul.) | - | - | 468207 | 236028 |

### 5.3 Performance of Scheme

This section describes the overall performance of Dilithium on Cortex-M3 and 8-bit AVR.

### 5.3.1 Cortex-M3

We compare the overall Cortex-M3 performance of Dilithium to existing works [GKS21, BRS22]. For dilithium2 and dilithium3, we provide three implementations: (i) a speedoptimized implementation using assembly Barrett-based NTTs/iNTT, (ii) a stack-optimized C implementation partially based on the stack optimizations proposed by [BRS22], and (iii) a stack-optimized implementation using assembly Barrett-based NTTs/iNTTs. For the speed-optimized implementations, we simply replace the assembly-optimized Montgomerybased NTTs/iNTTs by [GKS21] with our assembly Barrett-based NTTs/iNTTs. As for the stack-optimized implementations, we gradually apply memory optimization techniques from [BRS22] to the reference implementation until we can run the implementations on our platform. We then deploy our assembly Barrett-based NTTs/iNTTs. For dilithium5, we only provide stack-optimized implementations due to the large stack usage.

We first compare our speed-optimized assembly implementations to [GKS21]. For dilithium2, we reduce the key generation, signature generation, and signature verification cycles by $3.61 \%, 7.42 \%$, and $1.5 \%$, respectively. For dilithium3, we reduce the key generation, signature generation, and signature verification cycles by $2.3 \%, 23.29 \%$, and $0.69 \%$, respectively.

Next, we compare the performance of stack-optimized implementations. The most stack-optimized implementation is by [BRS22]. However, we cannot find publicly available source code. So, we start by applying memory optimization from [BRS22] to the C reference implementation until we can run the parameter set dilithium5. Therefore, comparisons to [BRS22] will be unfair, and we compare our own stack-optimized C and assembly implementations. For dilithium2, our assembly-optimized reduces the key generation, signature generation, and signature verification cycles by $14.74 \%, 27.45 \%$, and $24.68 \%$, respectively. For dilithium3, our assembly-optimized reduces the key generation, signature generation, and signature verification cycles by $12.96 \%, 27.00 \%$, and $22.72 \%$, respectively. For dilithium5, our assembly-optimized reduces the key generation, signature generation, and signature verification cycles by $11.71 \%, 23.75 \%$, and $19.69 \%$, respectively.

Table 9: Performance of Dilithium on Cortex-M3. K stands for key generation, $\mathbf{S}$ stands for signature generation, and $\mathbf{V}$ stands for signature verification.

| Security Level | Work | Operation |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | K |  | S |  | V |  |
|  |  | Cycles | Stack | Cycles | Stack | Cycles | Stack |
| II | Ref | 2164k | 38452 | 9193k | 52076 | 2377 k | 36364 |
|  | [GKS21] | 1913k | 38484 | 6603 k | 52116 | 1805k | 36404 |
|  | [BRS22]* | 2927k | 4.9 KiB | 18470k | 5.0 KiB | 4036k | 2.7 KiB |
|  | Stack (C) | 2354k | 7700 | 17538 k | 9068 | 2593 k | 9756 |
|  | Stack (ASM) | 2007k | 7692 | 12723k | 9052 | 1953 k | 9748 |
|  | Speed (ASM) | 1844k | 38444 | 6113 k | 52068 | 1778 k | 36356 |
| III | Ref | 3712k | 60972 | 13114 k | 79716 | 3895k | 57860 |
|  | [GKS21] | 3 309k | 60876 | 11681k | 79620 | 3026k | 57772 |
|  | [BRS22]* | 5112k | 6.4 KiB | 36303 k | 6.5 KiB | 7249k | 2.7 KiB |
|  | Stack (C) | 3757 k | 9748 | 25415 k | 11108 | 3957k | 10772 |
|  | Stack (ASM) | 3270k | 9740 | 18554k | 11092 | 3058k | 10764 |
|  | Speed (ASM) | 3 233k | 60964 | 8961k | 79708 | 3005k | 57852 |
| V | [BRS22]* | 8609k | 7.9 KiB | 44332 k | 8.1 KiB | 12616 k | 2.7 KiB |
|  | Stack (C) | 6199 k | 11796 | 36133 k | 13148 | 6501 k | 13076 |
|  | Stack (ASM) | 5473 k | 11788 | 27553 k | 13140 | 5221 k | 13068 |

* We cannot find a publicly available implementation for the work [BRS22] and the stack consumption is reported with unit KiB.


### 5.3.2 8-Bit AVR

Table 10: Performance of Dilithium on 8-bit AVR.

| Security Level | Work | Operation |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | K |  | S |  | V |  |
|  |  | Cycles | Stack | Cycles | Stack | Cycles | Stack |
| II | Ref** (C) | 73556 k | 9282 | 166961 k | 12059 | 86860 k | 12751 |
|  | Stack (C) | 47732 k | 9282 | 72731 k | 12059 | 49 138k | 12751 |
|  | Stack (ASM) | 44181 k | 9282 | 62414 k | 12059 | 44081k | 12751 |
| III | Ref** (C) | 154028k | 11330 | 491601 k | 14107 | 169 770k | 13775 |
|  | Stack (C) | 84579 k | 11330 | 212 376k | 14107 | 84213 k | 13775 |
|  | Stack (ASM) | 78786 k | 11330 | 188 290k | 14107 | 76267 k | 13775 |
| V | Ref ** (C) | 255058k | 13378 | 1091977 k | 16155 | 276570 k | 16079 |
|  | Stack (C) | 144925 k | 13378 | 521106 k | 16155 | 146 478k | 16079 |
|  | Stack (ASM) | 135 525k | 13378 | 471359 k | 16155 | 134076 k | 16079 |

** Our stack-optimized implementation based on reference code $\left[\mathrm{ABD}^{+} 20\right]$.

The pure reference code, as seen in Table 9, consumes a significant amount of SRAM in the 8 -bit AVR environment and cannot be simulated. Thus, we implement a stack-optimized version based on the reference code $\left[\mathrm{ABD}^{+} 20\right]$ and designate it as comparison code (denoted as $\operatorname{Ref}^{* *}$ ). Table 5.3.2 compares our implementation utilizing Barrett arithmetic with Ref ${ }^{* *}$ of all security levels of Dilithium. Given the limited stack space in the AVR environment, unlike Cortex-M3, we only consider the stack-optimized implementation. First, we compare the Ref** that uses Montgomery multiplication with our implementation using Barrett
multiplication. Both implementations are coded in the C language. For dilithium2, we reduce the key generation, signature generation, and signature verification cycles by $35.11 \%, 56.44 \%$, and $43.43 \%$, respectively. For dilithium3, we reduce the key generation, signature generation, and signature verification cycles by $45.09 \%, 56.80 \%$, and $50.40 \%$, respectively. For dilithium5, we reduce the key generation, signature generation, and signature verification cycles by $43.18 \%, 52.28 \%$, and $47.04 \%$, respectively.

Subsequently, we compare our hand-written assembly implementation with C-based Ref**. Compared with the Ref** for all security levels of Dilithium, our assembly implementation reduces the clock cycles by nearly half. For dilithium2, we reduce the key generation, signature generation, and signature verification cycles by $39.94 \%, 62.62 \%$, and $49.25 \%$, respectively. For dilithium3, we reduce the key generation, signature generation, and signature verification cycles by $48.85 \%, 61.70 \%$, and $55.08 \%$, respectively. For dilithium5, we reduce the key generation, signature generation, and signature verification cycles by $46.86 \%, 56.83 \%$, and $51.52 \%$, respectively.

All implementations ( $\operatorname{Ref}^{* *}(\mathrm{C})$, Stack (C), and Stack (ASM)) of each Dilithium operation consume the same stack size, and the code size is about 50 KiB in all implementations. As can be seen from the comparison between C implementations, the 32-bit multiply-long instruction is one of the most significant bottlenecks in the 8-bit AVR environment. Especially, the significant performance improvement in the Dilithium signature generation, dominated by the rejection loop, clearly highlights the benefits of Barrett multiplication.

## A Detailed Implementations

```
Algorithm 7 Implementation of mulhi_split on Cortex-M3.
Inputs: \(\mathrm{alo}=a_{l}\), \(\mathrm{ahi}=a_{h}\), blo \(=b_{l}\), bhi \(=b_{h}\).
Outputs: acchi \(=a_{h} b_{h}+\left\lfloor\frac{a_{1} b_{h}}{2^{16}}\right\rfloor+\left\lfloor\frac{a_{h} b_{l}}{2^{16}}\right\rfloor\).
    mul acchi, ahi, bhi \(\triangleright \operatorname{acchi}=a_{h} b_{h}\).
    mul accmid, alo, bhi \(\triangleright\) accmid \(=a_{l} b_{h}\).
    add acchi, acchi, accmid, asr \#16 \(\quad \triangleright\) acchi \(=a_{h} b_{h}+\left\lfloor\frac{a_{l} b_{h}}{2^{16}}\right\rfloor\).
    mul accmid, ahi, blo
\(\triangleright\) acchi \(=a_{h} b_{h}+\left\lfloor\frac{a_{l} b_{h}}{2^{16}}\right\rfloor+\left\lfloor\frac{a_{h} b_{l}}{2^{16}}\right\rfloor\).
```

```
Algorithm 8 Implementation of muls32xQ_lo32 on 8-bit AVR.
Inputs: \((\mathrm{a} 3\|\cdots\| \mathrm{a} 0)=a\)
Outputs: \((\mathrm{c} 3\|\cdots\| \mathrm{c} \|)=a q \bmod { }^{ \pm} 2^{32}\)
    movw c0, a0 movw c2, a2 ldi q, 0xE0 mul a0, q
    add c1, r0 adc c2, r1 adc c3, zero mul a2, q
    add c3, r0 mul a1, q add c2, r0 adc c3, r1
    ldi \(q, 0 x 7 F\) mul a0, q add c2, r0 adc c3, r1
    mul a1, q add c3, r0
```


## References

$\left[\mathrm{ABD}^{+} 20\right]$ Roberto Avanzi, Joppe Bos, Léo Ducas, Eike Kiltz, Tancrède Lepoint, Vadim Lyubashevsky, John M. Schanck, Peter Schwabe, Gregor Seiler, and Damien Stehlé. CRYSTALS-Dilithium. Submission to the NIST Post-Quantum Cryptography Standardization Project [NIS], 2020. https://pq-crystals.org/ dilithium/. 3, 4, 13, 14, 15
$\left[\mathrm{ACC}^{+} 21\right]$ Erdem Alkim, Dean Yun-Li Cheng, Chi-Ming Marvin Chung, Hülya Evkan, Leo Wei-Lun Huang, Vincent Hwang, Ching-Lin Trista Li, Ruben Niederhagen, Cheng-Jhih Shih, Julian Wälde, and Bo-Yin Yang. Polynomial Multiplication in NTRU Prime Comparison of Optimization Strategies on Cortex-M4. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2021(1):217238, 2021. https://tches.iacr.org/index.php/TCHES/article/view/8733. 9
$\left[\mathrm{ACC}^{+} 22\right]$ Amin Abdulrahman, Jiun-Peng Chen, Yu-Jia Chen, Vincent Hwang, Matthias J. Kannwischer, and Bo-Yin Yang. Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2022(1):127-151, 2022. https://tches.iacr.org/index.php/TCHES/ article/view/9292. 5, 13
[AHY22] Erdem Alkim, Vincent Hwang, and Bo-Yin Yang. Multi-Parameter Support with NTTs for NTRU and NTRU Prime on Cortex-M4. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2022(4):349-371, 2022. https://tches.iacr.org/index.php/TCHES/article/view/9823. 5
[ARM10] ARM. Cortex-M3 Technical Reference Manual, 2010. https://developer.arm. com/documentation/ddi0337/h. 2, 7, 9
[ARM21] ARM. Armv7-M Architecture Refernce Manual, 2021. https://developer.arm. com/documentation/ddi0403/ed. 7
[Atm16] Atmel. AVR Instruction Set Manual, 2016. https://ww1.microchip.com/ downloads/en/devicedoc/atmel-0856-avr-instruction-set-manual.pdf. 7
[Bar86] Paul Barrett. Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor. In CRYPTO 1986, LNCS, pages 311-323. SV, 1986. https://link.springer.com/chapter/10. 1007/3-540-47721-7_24. 6
$\left[\mathrm{BHK}^{+} 22\right]$ Hanno Becker, Vincent Hwang, Matthias J. Kannwischer, Bo-Yin Yang, and Shang-Yi Yang. Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2022(1):221-244, 2022. https://tches.iacr.org/index.php/TCHES/ article/view/9295. 4, 5, 6
[BRS22] Joppe W Bos, Joost Renes, and Amber Sprenkels. Dilithium for Memory Constrained Devices. In International Conference on Cryptology in Africa, pages 217-235. Springer, 2022. 14, 15
[CF94] Richard Crandall and Barry Fagin. Discrete Weighted Transforms and Large-integer Arithmetic. Mathematics of computation, 62(205):305324, 1994. https://www.ams.org/journals/mcom/1994-62-205/ S0025-5718-1994-1185244-1/?active=current. 5
[CT65] James W. Cooley and John W. Tukey. An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation, 19(90):297-301, 1965. https://www.ams.org/journals/mcom/1965-19-090/ S0025-5718-1965-0178586-1/. 5
[GKS21] Denisa O. C. Greconici, Matthias J. Kannwischer, and Daan Sprenkels. Compact Dilithium Implementations on Cortex-M3 and Cortex-M4. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2021(1):1-24, 2021. https:
//tches.iacr.org/index.php/TCHES/article/view/8725. 3, 8, 9, 10, 12, 13, 14, 15
[GPW $\left.{ }^{+} 04\right]$ Nils Gura, Arun Patel, Arvinderpal Wander, Hans Eberle, and Sheueling Chang Shantz. Comparing elliptic curve cryptography and rsa on 8 -bit cpus. In Cryptographic Hardware and Embedded Systems-CHES 2004: 6th International Workshop Cambridge, MA, USA, August 11-13, 2004. Proceedings 6, pages 119-132. Springer, 2004. 12
[GS66] W. M. Gentleman and G. Sande. Fast Fourier Transforms: For Fun and Profit. In Proceedings of the November 7-10, 1966, Fall Joint Computer Conference, AFIPS '66 (Fall), pages 563-578. Association for Computing Machinery, 1966. https://doi.org/10.1145/1464291.1464352. 5
[LSSR $\left.{ }^{+} 15\right]$ Zhe Liu, Hwajeong Seo, Sujoy Sinha Roy, Johann Großschädl, Howon Kim, and Ingrid Verbauwhede. Efficient ring-lwe encryption on 8-bit avr processors. In Cryptographic Hardware and Embedded Systems-CHES 2015: 17th International Workshop, Saint-Malo, France, September 13-16, 2015, Proceedings 17, pages 663-682. Springer, 2015. 11
[Mon85] Peter L. Montgomery. Modular Multiplication Without Trial Division. Mathematics of computation, 44(170):519-521, 1985. https://www.ams.org/journals/ mcom/1985-44-170/S0025-5718-1985-0777282-X/?active=current. 2, 6
[NIS] NIST, the US National Institute of Standards and Technology. Post-quantum cryptography standardization project. https://csrc.nist.gov/Projects/ post-quantum-cryptography. 3, 4, 16
[Ret21] RetroDan. AVR Assembler Site, 2021. https://avr-asm.tripod.com/avr201. html. 11
[Sei18] Gregor Seiler. Faster AVX2 optimized NTT multiplication for Ring-LWE lattice cryptography. 2018. https://eprint.iacr.org/2018/039. 2, 6, 11, 12
[Sho] Victor Shoup. NTL: a Library for Doing Number Theory. http://www.shoup. net/ntl/. 6
[STM20] STMicroelectronics. STM32F207ZG, 2020. https://www.st.com/en/ microcontrollers-microprocessors/stm32f207zg.html. 13


[^0]:    ${ }^{1}$ The notation $\bmod { }^{ \pm}$specifies that signed representation is used for defining $\mathbb{Z}_{n}$.

