# Generalized Universal Circuits for Secure Evaluation of Private Functions with Application to Data Classification 

Ahmad-Reza Sadeghi and Thomas Schneider<br>Horst Görtz Institute for IT-Security, Ruhr-University Bochum, Germany<br>\{ahmad.sadeghi,thomas.schneider\}@trust.rub.de


#### Abstract

Secure Evaluation of Private Functions (PF-SFE) allows two parties to compute a private function which is known by one party only on private data of both. It is known that PF-SFE can be reduced to Secure Function Evaluation (SFE) of a Universal Circuit (UC). Previous UC constructions only simulated circuits with gates of $d=2$ inputs while gates with $d>2$ inputs were decomposed into many gates with 2 inputs which is inefficient for large $d$ as the size of UC heavily depends on the number of gates. We present generalized UC constructions to efficiently simulate any circuit with gates of $d \geq 2$ inputs having efficient circuit representation. Our constructions are non-trivial generalizations of previously known UC constructions. As application we show how to securely evaluate private functions such as neural networks (NN) which are increasingly used in commercial applications. Our provably secure PF-SFE protocol needs only one round in the semi-honest model (or even no online communication at all using non-interactive oblivious transfer) and evaluates a generalized UC that entirely hides the structure of the private NN. This enables applications like privacy-preserving data classification based on private NNs without trusted third party while simultaneously protecting user's data and NN owner's intellectual property.


Key words: universal circuits, secure evaluation of private functions, neural networks, private data classification, privacy

## 1 Introduction

Today, a variety of new business models can be provided as electronic services where customers post their requests to a remote provider who performs specific knowledge based operations on their data and provides customers with the results. Examples are expert systems for heath diagnostics, remote data bases, multimedia data processing, or data classification tools (e.g., for spam). From security targets point of view customers may send sensitive and security critical data and hence require the protection (confidentiality and integrity) of their data, while the service providers may require the protection of their Intellectual Property (IP), i.e., their expertise embedded in their system.

Hence, the problem can be formulated as follows: two parties, a service requester $R$ (client) and a service provider $P$ (server), are involved in the computation of a function $f$ (belonging to $P$ ) on data $x$ (input by $R$ ) where $P$ should not obtain any information about $x$ and $R$ should not get any useful information about $f$ besides the result $f(x)^{1}$,

For arbitrary functions $f$, this is tackled by Secure Function Evaluation (SFE) of Private Functions (PF-SFE). SFE protocols [28|2|14|13|9|5] allow two parties to securely evaluate a common function on private data. Based on this, PF-SFE [20|16|10] evaluates a Universal Circuit (UC) [25|10] as common function which is programmed with the private function $f$. As UC can be programmed to simulate any function, it entirely hides $f$ while SFE ensures privacy of data $x$.

Previous UC constructions can simulate circuits with gates of $d=2$ inputs only. For example, circuits performing arithmetic operations like addition, number comparison, or multiplication are most efficiently implemented from chains of $d=3$ input gates (full adders, and full comparers). When these types of circuits need to be simulated with known UC constructions (to hide which arithmetic operations are performed), the large gates must be decomposed into many $d=2$ input gates, e.g., five gates per 3-input gate using Shannon's expansion theorem: $\left.f(a, b, c)=\left(\left.c \wedge f(a, b)\right|_{c=1}\right) \vee\left(\left.\bar{c} \wedge f(a, b)\right|_{c=0}\right)\right)$.

To overcome this overhead, we generalize previous UC constructions to $d \geq 2$ as non-trivial extensions of previous work together with new constructions that are especially suited to simulate small circuits. Our constructions are much more efficient than using the straight-forward solution of evaluating $\lceil d / 2\rceil$ known UCs (where $d=2$ ) in parallel. The overhead of our best Generalized UC (GUC) construction is by a factor of two smaller than the best known UC construction for the practical example given in $\$ 5.4$.

As application of our GUC constructions we show how to securely evaluate private Neural Networks (NN) where each neuron has $d$ inputs. Amongst others, NNs are very useful tools for data classification including pattern/sequence recognition, and sequential decision making. NNs are increasingly becoming important for deployment in commercial applications like spam filtering [3], speech recognition [24, and many more [2327] where the "expertise" of the provider is embedded in NN. Neural networks allow to model any function [7], and are robust against noise. Previous work [15]17] is based on straight forward use of homomorphic encryption in multiple rounds that cannot hide NN's structure completely and is not provably secure. In contrast we show that our solution (i) is efficient w.r.t. the size of the circuit needed for a reasonable NN, (ii) requires no or only one round to evaluate a GUC that (iii) hides the underlying NN and its topology entirely and (iv) is provably secure in the semi-honest model.

### 1.1 Related Work

Two-Party Secure Function Evaluation (SFE) protocols [28|29] allow two parties to evaluate any function represented as boolean circuit which is known to

[^0]both of them on their respective private inputs. All of these protocols are provably secure against semi-honest adversaries and can be extended to be secure against malicious adversaries via cut-and-choose 1413|5. SFE in semi-honest model is based on Oblivious Transfer (OT) requiring one round of communication 1918 or a non-interactive implementation based on an extension of trusted computing modules 6.

NNs can be represented as boolean circuits - threshold NNs with backpropagation [21] or evolutionary learning algorithms [18] were implemented in hardware, but were not considered to be evaluated within SFE protocols yet.

SFE can be extended to private functions (PF-SFE) where the private function is known to one party only and hidden entirely from the other party. PF-SFE is reduced to SFE of a Universal Circuit (UC) that is programmed with the private function and entirely hides its structure [20|16|10]. The SFE protocol of [9] which allows very efficient evaluation of UCs can also be used to improve evaluation of GUCs of this paper (by a factor between two- and fourfold).

Currently known UC constructions 2510 can simulate gates with $d=2$ inputs only. Gates with more than two inputs can be simulated by decomposing them into multiple gates with two inputs.

Oblivious training and evaluation of NNs was studied in the context of Oblivious Polynomial Evaluation (OPE) [2]. The OPE protocol reduces oblivious polynomial evaluation to OT. Based on this protocol, they show how to train and evaluate NNs in multiple rounds without hiding the structure of the NN. Nonlinear activation functions are either evaluated using a circuit based SFE protocol or piece-wise approximated as polynomials evaluated via OPE protocol.

Oblivious NN evaluation using homomorphic encryption [15|17] requires multiple rounds as well - one per layer of the NN. The protocol allows evaluation of NNs with several activation functions like threshold or sigmoid function. To hide the structure of the NN, dummy neurons are introduced but this does not entirely hide the structure (e.g., maximal number of neurons per layer and maximal number of layers are revealed as outlined in their paper). Also, these protocols are not provably secure as blinding an additively homomorphic encrypted value with a randomly chosen factor reveals information on the magnitude of the value.

### 1.2 Our Contributions

In $\$ 4$ we present practical Generalized Universal Circuit (GUC) constructions to efficiently simulate circuits of gates with $d \geq 2$ inputs having efficient circuit representations. Former UCs are restricted to $d=2$ and decompose larger gates into multiple gates with two inputs resulting in much more overhead than our GUC constructions (cf. Table 1 in $\$ 5.4$ ). Our constructions are non-trivial generalizations of known UC constructions that are special cases of ours for $d=2$.

Based on these GUC constructions we present protocols to securely evaluate private Neural Networks (NN) with activation functions implemented as small circuits (such as threshold function) in $\$ 5$. Unlike previous work, our protocols are provably secure, need a constant number of rounds (one round in the semi-honest model or even no online communication at all using non-interactive

OT), hide NN's structure entirely (besides number of in- and outputs, maximum degree $d$ and number of neurons $k$ ) and are still practical.

### 1.3 Basic Idea and Outline

A Generalized Universal Circuit (GUC) is a circuit which can be programmed to simulate an arbitrary circuit that consists of gates with $d$ inputs each. These gates are required to have an efficient circuit representation which is the case in our example for neurons in $\$ 5$ or if the size of their function table $2^{d}$ is small.

A GUC can be thought of as a kind of processor (here: programmable circuit) that takes as input some data $x$ and a program $p_{f}$ corresponding to a function (here: input circuit) and evaluates the program on the data (here: input circuit on data): $U C\left(p_{f}, x\right)=f(x)$. As GUC can be programmed with any function it does not reveal anything about the function. This allows to evaluate arbitrary functions privately. In contrast to previous UC constructions we relax the restriction that input circuits need to consist of gates of $d=2$ inputs only.

In $\$ 4$ we give different methods to construct GUCs in an iterative ( $\$ 4.1$, modular ( $\S 4.2$ ), or graph based ( $\S 4.3$ ) way and compare them in $\$ 4.4$. Definitions are given in $\$ 2$, new building blocks in $\$ 3$,

As practical application where simulation of $d$ input gates is advantageous, we show how neural networks (NN), described in $\$ 5.1$, can be expressed as circuits consisting of $d$ input gates in $\$ 5.2$. Their structure can be entirely hidden inside a GUC which allows secure evaluation of private NNs as shown in 45.3 . Finally, we compare our GUC construction and its application to securely evaluate private NNs to previously known constructions and protocols in $\$ 5.4$.

## 2 Definitions and Preliminaries

The following definitions generalize those of [10] to gates with multiple inputs.
A gate $G$ is the implementation of a boolean function $\{0,1\}^{d} \rightarrow\{0,1\}$ with $d$ inputs and one output. The size of a gate $G$, denoted by $|G|$, is the multiple of function table entries needed to implement the gate w.r.t. a 2 input gate, namely $|G|=2^{d-2}$ (e.g., $|B|_{d=1}=0.5,|B|_{d=2}=1$, etc.).

We consider acyclic circuits consisting of connected gates with arbitrary fanout, i.e., the output of each gate can be used as input to arbitrary many gates. Further, each output of circuit $C$ is the output of a gate and not a redirected input of $C$. The size of a circuit is the sum of the sizes of its gates. Communication and computation complexity of SFE protocols is linear in the size of the circuit.

A programmable gate is a gate with an unspecified function table. To program it, a specific function table with $2^{d}$ entries for each input combination is given.

To simplify presentation we group gates into functional blocks as follows:
A block $B_{v}^{u}$ is a sub-circuit with $u$ inputs $i n_{1}, . ., i n_{u}$ and $v$ outputs out ${ }_{1}, . .$, out $_{v}$. $B_{v}^{u}$ computes a function $f_{B}:\{0,1\}^{u} \rightarrow\{0,1\}^{v}$ that maps the input values to the output values. For simplicity, we identify $B_{v}^{u}$ with $f_{B}$ and write: $B\left(i n_{1}, . ., i n_{u}\right)=$
$\left(\right.$ out $_{1}, . .$, out $\left._{v}\right)$. Blocks consist of connected gates and other sub-blocks. The size of block $B$, denoted by $|B|$, is the sum of the sizes of its sub-elements.

A programmable block is a block consisting of programmable gates or programmable blocks. It is programmed by programming each of its sub-elements.

A generalized Universal Circuit (GUC) $\mathrm{UC}_{k, u, v \times d}$ is a programmable block with $u$ inputs and $v$ outputs that can be programmed (denoted by $\mathrm{UC}_{k, u, v \times d}^{C}$ ) to simulate any circuit $C$ with up to $u$ inputs, $v$ outputs and $k$ gates with $d$ inputs, i.e., $\forall\left(i n_{1}, . ., i n_{u}\right) \in\{0,1\}^{u}: \mathrm{UC}_{k, u, v \times d}^{C}\left(i n_{1}, . ., i n_{u}\right)=C\left(i n_{1}, . ., i n_{u}\right)$.

We use programmable block constructions from [10] with the given number of in- and outputs and the following informal functionalities (see [10 for exact definitions and constructions): $Y$ switching block ( $Y_{1}^{2}$ programmable as left: out $_{1}=i n_{1}$ or right: out ${ }_{1}=i n_{2}$ ), $X$ switching block ( $X_{2}^{2}$ programmable as pass: $\left(\right.$ out $_{1}$, out $\left._{2}\right)=\left(i n_{1}, i n_{2}\right)$ or cross: $\left(\right.$ out $_{1}$, out $\left.\left._{2}\right)=\left(i n_{2}, i n_{1}\right)\right), S_{v}^{u}$ selection block (programmable to select for each of the $v$ outputs any of the $u$ inputs including duplicates). Their sizes and properties are summarized in Table 3 in $\$ B$.

Our constructions use different compositions of wires: A (single) wire has value either 0 or 1 and is drawn as thin arrow $(\longrightarrow)$.

A multi wire $W$ consist of $\omega$ wires with fixed ordering. The single wires can be indexed by $W[1], . ., W[\omega]$. The value of $W$ is the unsigned integer value $w=\sum_{i=1}^{\omega} 2^{i-1} W[i]$. Multi wires are drawn as filled thick arrows $(\longrightarrow)$.

A bundle consists of wires with irrelevant ordering and no duplicates (no two wires are the output of the same gate). $u \times d$ denotes $u$ bundles of $d$ wires each. Bundles are drawn as unfilled arrows ( $\longrightarrow$ ).

Exact calculations of the sizes of our constructions and building blocks with all intermediate steps are given in

## 3 Bundle Blocks for GUC Constructions

The main difference of our efficient GUC constructions compared to previous UC constructions is to switch bundles of $d$ wires instead of single wires only. To do this, we construct efficient bundle blocks that are used as the fundamental building blocks of the GUC constructions described in 4
$C_{\boldsymbol{d}}^{\boldsymbol{u}}$ Choice Block is a programmable block that can be programmed to choose from the $u$ inputs a bundle of $d$ distinct values as outputs (without recurrence) where the order of the outputs does not matter. More formally, given a subset $\Gamma \subseteq\{1, . ., u\},|\Gamma|=d$, the choice block computes $C\left(i n_{1}, . ., i n_{u}\right)=\left(i n_{\gamma_{1}}, . ., i n_{\gamma_{d}}\right)$ where $\Gamma=\left\{\gamma_{1}, . ., \gamma_{d}\right\}$ (note, the set equality implies irrelevance of ordering and no duplicate inputs). Of course the definition of a choice block makes only sense for $u \geq d$ as $\Gamma$ is undefined for $u<d$.

A simple implementation of a $C_{d}^{u}$ choice block, $C_{d}^{u, \text { simple }}$, is to use $d$ selection blocks $S_{1}^{u-d+1}$ in parallel: out ${ }_{i}=S_{1}^{u-d+1}\left(i n_{i}, . ., i n_{u-d+i}\right) ; i=1, . ., d$, i.e., the first selection block can be programmed to select any of the inputs $i n_{1}, . ., i n_{u-d+1}$, the second any of $i n_{2}, . ., i n_{u-d+2}$ and so on. This results in
$\left|C_{d}^{u, \text { simple }}\right|=d \cdot\left|S_{1}^{u-d+1}\right|=d u-d^{2}$. The equation also holds true for $u=d$ where the choice block consists of wires only and its size is 0 . The straight-forward programming algorithm works as follows: sort $\Gamma$ ascendingly; for $i=1, . ., d$ do: let $k$ be the smallest element in $\Gamma$; program the $i$-th selection block to select $i n_{k}$; remove $k$ from $\Gamma$; next $i$. Correctness and efficiency are easy to verify.

Alternatively, a choice block, $C_{d}^{u, \text { sublin }}$, which is much more efficient for larger $d$ can be derived as a special case of bundle permutation block described next.
$\boldsymbol{B P} \boldsymbol{P}_{\boldsymbol{v} \times \boldsymbol{d}}^{\boldsymbol{u} \geq \boldsymbol{d}}$ Bundle Permutation Block is a programmable block that can be programmed to permute the $u$ inputs to $v$ bundled outputs of $d$ wires each (without duplicates). We define bundle permutation blocks to have at least as many inputs as outputs: $u \geq v d$. More formally, let $S \subseteq 1, . ., u ;|S|=v d$ be the subset of inputs that are chosen as outputs and $\Phi=\left(\Phi_{1}, . ., \Phi_{v}\right)$ be an ordered partition of $S$ with $\bigcup_{i=1}^{v} \Phi_{i}=S ;\left|\Phi_{i}\right|=d$, the block computes $B P\left(i n_{1}, . ., i n_{u}\right)=$ $\left(i n_{\varphi_{1,1}}, . ., i n_{\varphi_{1, d}}, . ., i n_{\varphi_{v, 1}}, . ., i n_{\varphi_{v, d}}\right)$, with $\Phi_{i}=\left\{\varphi_{i, 1}, . ., \varphi_{i, d}\right\} ; i=1, . ., v$.

Our efficient implementation of $B P_{v \times d}^{u \geq v d}$ bundle permutation block is based on the truncated permutation block construction of [10]. A $T P_{v}^{u \geq v}$ truncated permutation block is a programmable block that can be programmed to permute its $u \geq v$ inputs to its $v$ outputs ( $u$ and $v$ need not be equal or powers of two as in [26]). Remaining $u-v$ inputs are discarded (truncated permutation).
$T P_{v}^{u}$ block is constructed recursively from two $T P_{v / 2}^{u / 2}$ sub-blocks and $u / 2-1$ $X$ switching blocks on top that distribute the inputs to the sub-blocks and $v / 2 X$ switching blocks on bottom to distribute their outputs to the outputs of $T P_{v}^{u}$ block as shown in Fig. 1(a). W.l.o.g. we assume $u$ and $v$ are even at each recursion step (otherwise we introduce an unused dummy input or output with small overhead). Note, that our construction is upside-down compared to the original construction of [10] to have the saved $X$ blocks on top instead (the programming algorithm remains the same using the inverse permutation instead). This modification implies that our GUC construction M3 in special case $d=2$ is more efficient than the original UC construction of [10].

To obtain an efficient $B P_{v \times d}^{u \geq v d}$ bundle permutation block, a $T P_{v d}^{u \geq v d}$ truncated permutation block is constructed without the lowest $\log d$ layers of $X$ switching blocks which can be replaced with wires as the order within each bundle of $d$ wires is irrelevant. Hence, $\left|B P_{v \times d}^{u}\right|=\left|T P_{v \bar{d}}^{u \geq v d}\right|-|X| \cdot \log d \cdot d v / 2=$ $(u+d v) \log v+(\log d+1) u-3 d v+2$. An efficient programming algorithm for bundle permutation blocks can easily be derived from the one given in [10.

Our efficient construction of the bundle permutation block is indeed a generalization of the truncated permutation block of [10] which is a special case for $d=1$ of our construction $\left(B P_{v \times 1}^{u \geq v} \equiv T P_{v}^{u \geq v}\right)$ with exactly the same size $\left|B P_{v \times 1}^{u \geq v}\right|=\left|T P_{v}^{u \geq v}\right|=(u+v) \log v+u-3 v+2$.

By fixing the other parameter $v=1$, we obtain a more efficient (sub-linear in the number of outputs $d$ ) construction for choice blocks, $C_{d}^{u, \text { sublin }}:=B P_{1 \times d}^{u \geq d} \equiv$ $C_{d}^{u}$, with size $\left|C_{d}^{u, \text { sublin }}\right|=\left|B P_{1 \times d}^{u \geq d}\right|=(\log d+1) u-3 d+2$.

## 4 Generalized Universal Circuits

A generalized universal circuit (GUC) $U C_{k, u, v \times d}$ is a boolean circuit that can be programmed to simulate any circuit with $u$ inputs, $v$ outputs and $k$ gates with $d \geq 2$ inputs each. Existing UC constructions 2510] can simulate gates with two inputs only and can be seen as a special case of our corresponding generalized constructions for $d=2 . U C_{k, u, v \times d}$ has exactly $u$ inputs and $v$ outputs.

Each $d$ input gate of the simulated circuit is simulated within a gate simulation block, i.e., a programmable block which can be programmed to simulate the functionality of the gate and has $d$ inputs and 1 output. Gates are simulated in topologic order which can be computed efficiently by topologic sorting in $O(k)$.

In the following, we assume that the order of the inputs of a gate simulation block is irrelevant and no inputs are duplicated. This is the case for gate simulation blocks implemented as a $d$ input programmable gate that is programmed with the same function table as the simulated $d$ input gate of exponential size $|G|=2^{d-2}$. The entries of the function table can be swapped according to an arbitrary input ordering and duplicate inputs can easily be eliminated. Also for gate simulation blocks that implement neurons as a circuit the order of inputs is irrelevant and duplicate inputs can be eliminated as we will explain in $\$ 5$. The irrelevance of the input ordering without duplicates is reflected by bundles of $d$ wires as input into a gate simulation block.

If the order of inputs of the simulated gates is relevant or duplicate inputs are needed, the following GUC constructions can be extended by replacing the bundle blocks from $\S 3$ with their corresponding non-bundled counterparts where the order of outputs does matter at the cost of a small overhead: $C_{d}^{u, \text { simple }} \mapsto S_{d}^{u}, C_{d}^{u, \text { sublin }} \mapsto S_{d}^{u \geq d}, B P_{v \times d}^{u \geq v d} \mapsto T P_{v \bar{d}}^{u \geq v d}$.

We stress that the sizes of all building blocks and the GUCs presented in the remainder of this section only depend on the parameters $u, v, k, d$ but neither on the input data nor the simulated circuit. Hence, dynamic choice of the smallest implementation for each building block in so called combined constructions respectively choosing smallest GUC construction reveals nothing about the input data nor the simulated circuit.

### 4.1 Iterative GUC Constructions

A simple GUC is constructed iteratively by choosing for the $i$-th gate simulation block $G_{i}$ any of the $u$ inputs of the circuit or the output of a previous gate simulation block $G_{1}, . ., G_{i-1}$ with $C_{d}^{u+i-1}$ choice block. The $v$ outputs of the GUC can be selected to be any of the outputs of the $k$ gate simulation blocks using $S_{v}^{k \geq v}$ selection block.
Using simple $C_{d}^{u \text {,simple }}$ choice blocks results in a total size of

$$
\begin{aligned}
\left|U C_{k, u, v \times d}^{\mathbf{I 1}}\right| & =0.5 d k^{2}+\left(d u-d^{2}-0.5 d+|G|+1\right) k+(k+3 v) \log v-4 v+3 \\
& \sim 0.5 d \cdot k^{2}+d \cdot u k .
\end{aligned}
$$

With sub-linear $C_{d}^{u, \text { sublin }}$ choice blocks instead the construction has size

$$
\begin{aligned}
\left|U C_{k, u, v \times d}^{\mathbf{I 2}}\right|=( & 0.5 \log d+0.5) k^{2}+(u \log d+u-0.5 \log d-3 d+|G|+2.5) k \\
& +(k+3 v) \log v-4 v+3 \sim 0.5 \log d \cdot k^{2}+\log d \cdot u k .
\end{aligned}
$$

A combination of both approaches chooses the smallest implementation for each choice block (simple or sub-linear) dynamically.

$$
\left|U C_{k, u, v \times d}^{\mathrm{I}}\right| \leq \min \left(\left|U C_{k, u, v \times d}^{\mathrm{I} 1}\right|,\left|U C_{k, u, v \times d}^{\mathbf{I} 2}\right|\right) .
$$

All iterative GUC constructions are only practical for a small number of simulated gates $k$ and few inputs $u$.

### 4.2 Modular GUC Constructions

Another approach to construct a GUC is to separate the inputs and the outputs from the simulation of the gates. This results in modular GUC constructions which are a generalization of the modular UC construction of [10]. The modular $G U C$ is composed out of three programmable blocks as shown in Fig. 1(b). The generalized Universal Block (UB), $U_{k \times d}$, simulates the $k$ gates, the input selection block chooses the corresponding inputs and the output selection block chooses the outputs of the simulated gates as outputs of the modular GUC. The construction has size

$$
\left|U C_{k, u, v \times d}^{\mathrm{Mi}}\right|=\left|S_{d k \geq u}^{u}\right|+\left|U_{k \times d}^{\mathrm{Mi}}\right|+\left|S_{v}^{k \geq v}\right| \sim 2 d k \log k+d k \log u+\left|U_{k \times d}^{\mathrm{Mi}}\right| .
$$

The overall complexity is determined by the complexity of the generalized UB $U_{k \times d}$ that only depends on the number of simulated gates $k$ and no longer on the number of inputs $u$ or outputs $v$. The generalized UB $U_{k \times d}$ has $k$ bundles of $d$ inputs where the $i$-th bundle, $i n_{d i}, . ., i n_{d i+d-1}$, can be switched to the $i$-th gate simulation block $G_{i}$; the output of $G_{i}$ is connected to out $t_{i}$ of $U_{k \times d}$.

Next, we give different constructions for generalized UBs that can be plugged into the modular GUC construction. The iterative constructions (generalized from [22, Section 5.3.1]) grow like $d \cdot k^{2}$ and $\log d \cdot k^{2}$ and the recursive construction (generalized from [10) grows like $d k \log ^{2} k$.

Iterative Generalized Universal Block Construction. An iterative construction for a generalized UB is similar to the iterative GUC construction described in $\$ 4.1$ but without the dependancy on the inputs that are handled efficiently by the input selection block of the modular GUC construction. For each gate simulation block $G_{i}$, a $C_{d}^{d+i-1}$ choice block can be programmed to choose any of the $d$ wires of the $i$-th input bundle of the generalized UB and the $i-1$ outputs of the previous gate simulation blocks $G_{1}, . ., G_{i-1}$ as input to $G_{i}$. Using simple $C_{d}^{u, \text { simple }}$ choice blocks this results in a total size of

$$
\left|U_{k \times d}^{\mathrm{M1}}\right|=0.5 d k^{2}-0.5 d k+k \cdot|G| \sim 0.5 d \cdot k^{2} .
$$



Fig. 1. Building blocks for GUCs

With sub-linear $C_{d}^{u, \text { sublin }}$ choice blocks instead the size is

$$
\begin{aligned}
\left|U_{k \times d}^{\mathrm{M} 2}\right| & =(0.5 \log d+0.5) k^{2}+(d \log d-0.5 \log d-2 d+|G|+1.5) k \\
& \sim 0.5 \log d \cdot k^{2}
\end{aligned}
$$

Both modular iterative constructions still grow like $k^{2}$ but are more efficient than the iterative GUC constructions from 4.1 for circuits with many inputs due to the efficient handling of inputs with the input selection block.

Recursive Generalized Universal Block Construction. A generalization of the recursive UB construction of [10 yields a generalized UB of size

$$
\begin{gathered}
\left|U_{k \times d}^{\mathrm{M} 3}\right|=(0.625 d+0.25) k \log ^{2} k+(0.5 d \log d-0.625 d-1.25) k \log k \\
+(|G|+3) k-3 \sim 0.625 d k \log ^{2} k
\end{gathered}
$$

For lack of space, the detailed description of the construction is in $\S \mathrm{C}$ Compared to the constructions presented before, the recursive construction grows like $k \log ^{2} k$ instead of $k^{2}$ which is clearly much slower for larger circuits.

Combined Generalized Universal Block Construction. A combination of these generalized UB constructions uses the smallest generalized UB implementation (M1, M2 or M3) dynamically. Dynamic Programming avoids recalculation of the smallest construction for given parameters by caching it in a table.

$$
\left|U_{k \times d}^{\mathrm{M}}\right| \leq \min \left(\left|U_{k \times d}^{\mathrm{M} 1}\right|,\left|U_{k \times d}^{\mathrm{M} 2}\right|,\left|U_{k \times d}^{\mathrm{M} 3}\right|\right)
$$

### 4.3 Universal Graph based GUC Construction

A generalization of Valiant's UC construction [25] which is based on Universal Graphs results in a GUC construction of size

$$
\left|U C_{k, u, v \times d}^{\mathrm{UG}}\right| \sim 4.75 d\left(2 k+u+\frac{v}{d-1}\right) \log k .
$$

This GUC construction is asymptotically better than those shown before for large circuits and is based on the following theorem.

Theorem 1. Each circuit of arbitrary fan-out with $v$ outputs and $k$ gates of $d$ inputs each can be converted into an equivalent circuit with fan-out $\leq d$ by adding at most $k+\frac{v}{d-1}$ gates.

For lack of space, the detailed description of this construction and the proof of the theorem are given in $\$$.

### 4.4 Comparison of GUC Constructions

The sizes of the different GUC constructions for several practical parameters are shown in Fig. 4 of $\$$ B Which construction is the smallest (and hence should be used for least overhead) depends on the parameters $d, k, u$, and $v$ only. Quadratic constructions ( $I 1, I 2, M 1, M 2$ ) are suitable for small, whereas recursive construction (M3) is better for mid-size and universal graph based construction $(U G)$ for large circuits.

## 5 Secure Evaluation of Private NNs with GUCs

### 5.1 Structure of NNs

A neural network (NN) is an acyclic directed graph of several neurons. The neurons are arranged in multiple layers (Fig. 2) in topologic order. Each neuron has $d$ input bits, one output bit, internal precision of $s$ bits and is structured as shown in Fig. 3(a). Each input bit $i n_{i}$ of a neuron is multiplied with a $s$ bit constant weight $w_{i}$. The $\Sigma$ block computes the sum $\sigma$ of these weighted inputs. The threshold function $\tau$ compares $\sigma$ with a threshold value $t$ and sets the output: if $\sigma:=\sum_{i=1}^{d} w_{i} \cdot i n_{i}<t$, then out $=0$, else out $=1$.

The input neurons of the first layer in the NN have only 1 input with multiple bits $m$. They can also be realized with a neuron as shown in Fig. 3(a) by setting the weights correspondingly.

Neurons fulfill the restrictions for $d$ input gates of 4 . Inputs can be permuted arbitrarily by permuting their weights in the same way. Duplicate inputs into a neuron with the same source can be merged into one input by adding their


Fig. 2. NN with 4 layers of $k=10$ neurons with degree $d=3$ including $u=3$ input neurons and $v=2$ outputs.
weights. Implementation of neurons presented next guarantees that the weights of neurons remain hidden from requester and hence these modifications are not detectable for him.

### 5.2 Circuit Implementation of Neurons

If the number of inputs $d$ is small, each neuron can be implemented as programmable $d$ input gate of size $\left|N_{d, s}^{\text {gate }}\right|=2^{d-2}$ (for arbitrary activation function), otherwise as programmable $d$ input block (Fig. 3(a). The neuron can be programmed depending on the weights $\omega_{1}, . ., \omega_{d}$ and the threshold value $t$.


Fig. 3. Circuit implementation of a neuron

As before we give directly the total size of the building blocks. Exact calculations with all intermediate steps are given in $\$$ E. 2 .

The $w$ block multiplies its input bit with constant $\omega$ of $s$ bits (Fig. 3(c)). The bits $\omega[i], i=1, . ., s$, determine the programming of the programmable gate $W_{i}$ : if $\omega[i]=0$, then $W_{i}=0$, else $W_{i}=i n$. The size is $\left|W_{i}\right|=0.5,|w|=s \cdot\left|W_{i}\right|=0.5 s$.

The $\Sigma$ block sums up the $d$ input values of $s$ bits each to a $s+\log d$ bit value by pairwise adding them in a tree (Fig. 3(b)). An adder $A D D_{s+1}$ to add two $s$ bit values to an $s+1$ bit value is composed as usual from a half adder $H A$ and $s-1$ full adders $F A$ (Fig. $3(\mathrm{~d})$. The size is $|H A|=2,|F A|=4$, $\left|A D D_{s+1}\right|=4(s+1)-6,|\Sigma|<4 d s+2 d-4 s+6$.

The $\tau$ block compares the $s+\log d$ bit input with an $s+\log d$ bit constant $t$ (Fig. $3(\mathrm{e})$ ). The carry $c_{i}=0, i=1, . ., s+\log d$, tells that the $i$ least significant bits of in are less than the $i$ least significant bits of $t$ : if $\left(\left(i n \bmod 2^{i}\right)<\left(t \bmod 2^{i}\right)\right)$, then $c_{i}=0$, else $c_{i}=1$. Depending on $t[1]$, the programmable half comparer block $H C$ is programmed: if $t[1]=0$, then $H C=1$, else $H C=i n_{1}$. The remaining bits $t[i], i=2, . ., s+\log d$, determine the program of the programmable full comparer blocks $F C_{i}$ : if $t[i]=0$, then $F C_{i}=i n_{i} \vee c_{i-1}$, else $F C_{i}=i n_{i} \wedge c_{i-1}$. The size is $|H C|=0.5,|F C|=1,|\tau|=s+\log d-0.5$.

The total size of a neuron implemented as programmable block is $\left|N_{d, s}^{\text {block }}\right|<4.5 d s+2 d-3 s+\log d+5.5$.

### 5.3 Protocol for Oblivious Evaluation of NNs using GUCs

Oblivious evaluation of NNs is reduced to SFE of GUCs similar to the reduction for PF-SFE [10. A GUC is programmed to simulate the structure of the NN. Each gate simulation block $G$ is instantiated with a programmable circuit for a neuron $N_{d, s}$ programmed with the coefficients of the neuron it simulates: $U C_{N N}=\left.U C_{k, u, v \times d}\right|_{G=N_{d, s}}$. Programmed GUC simulates the NN and entirely hides its structure (besides size and number of inputs and outputs): $\forall\left(i n_{1}, . ., i n_{u}\right) \in\{0,1\}^{u}: \mathrm{UC}_{N N}\left(i n_{1}, . ., i n_{u}\right)=N N\left(i n_{1}, . ., i n_{u}\right)$.

When requester evaluates the programmed $U C_{N N}$ with a SFE protocol, he learns no more about $N N$ than the maximal number of neurons $k$, maximal degree of neurons $d$, internal precision $s$, inputs $u$ and outputs $v$.

The protocol needs one round in the semi-honest model (using interactive OT such as [19118]) or is non-interactive (using non-interactive OT of [6]). Recall, the reduction from PF-SFE to SFE using UC (resp. GUC in our case) is noncryptographic and the security of the PF-SFE protocol is exactly that of the underlying SFE protocol which is provably secure against semi-honest adversaries (e.g., 28129 ). This can be extended to be provably secure against malicious adversaries by using correspondingly secure SFE protocols (e.g., 1413|5) which need more than one but still a constant small number of rounds.

### 5.4 Comparison with Previous Work

We compare our GUC constructions with existing UC constructions before comparing our protocol for secure evaluation of NNs with existing protocols. As example for both comparisons we use the practical NN to classify sonar targets from [4] for which performance results are given in [15]. However, we use threshold instead of sigmoid as activation function as performance of [15] is almost the same independent of the used activation function. Besides this we use exactly the same parameters for the neural network, namely $u=60$ inputs, $v=2$ outputs, $k=12$ hidden neurons with an internal resolution $s=20$ (corresponding to their quantization factor $Q=10^{-6}$ ). In order to obfuscate the structure of NN they propose to embed NN into a grid of 5 layers with 15 neurons each. Hence, the in-degree of each neuron is $d=15$ while the total number
of neurons is hidden to be less than $k=75$. This results in neurons of size $|N|=\left|N_{d=15, s=20}^{\text {block }}\right| \leq 1,330<\left|N_{d=15, s=20}^{\text {gate }}\right|=8,192$.

We compare three possible alternatives that protect the internal weights and thresholds of the neurons and incrementally protect the structure of the NN:
(A) Embed NN into a $5 \times 15$ grid to obfuscate its structure (same as [15]): evaluate $75 \cdot|N| \leq 99,750$ gates.
(B) Hide the structure of the NN entirely:
simulate NN of $k=12$ neurons $(12 \cdot|N| \leq 15,960$ gates) in GUC $\left(\left|U C_{k=12, u=60, v=2 \times d=15}^{\mathrm{M} 1}\right| \leq 2,304\right.$ gates, cf. Fig. $4(\mathrm{~g})$ in $\left.\S \mathrm{B}\right)$.
(C) Additionally hide the number of neurons to be less than 75 :
simulate NN of $k=75$ neurons ( $75 \cdot|N| \leq 99,750$ gates) in GUC
$\left(\left|U C_{k=75, u=60, v=2 \times d=15}^{\mathrm{M} 2}\right| \leq 27,371\right.$ gates, cf. Fig. 4(h) in $\left.\$ \mathrm{~B}\right)$.

Comparison of GUC Construction. As shown in Table 1 , the overhead of our GUC introduced in case (B) and (C) is very moderate compared to known UC constructions. Applying $U C$ directly results in an overhead which is by a factor of $>10^{3}$ times bigger, while using $\lceil d / 2\rceil=8$ parallel UCs still is by a factor of more than two times bigger than our solution.

Table 1. Comparison of UC overhead (in number of gates)

|  | UC |  | Parallel UCs |  | GUC |  |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
|  | $[25]$ | $[10]$ | $25]$ | $[10]$ | $\boxed{4}]$ |  |
| (B) | $k=12$ | $4,242,114$ | $5,556,431$ | 23,432 | 6,577 | 2,304 |
| $(\mathrm{C})$ | $k=75$ | $31,482,358$ | $47,478,158$ | 100,359 | 58,618 | 27,371 |

Comparison of Protocol for Secure Evaluation of NNs. We used Fairplay SFE system [14] implemented in Java as well without cut and choose step to evaluate a circuit with $u=60$ inputs, $v=2$ outputs and the given number of gates within two processes on a notebook with 2.16 GHz Intel Core 2 processor and 2 GB memory. As [15] do not specify the exact hardware used ("two midlevel notebooks, connected on a LAN network"), the results of the comparison shown in Table 2 are qualitative but not necessarily quantitative:

Unlike the protocol in [15], ours are provably secure and approaches (B)] and (C) hide the structure of NN better than just obfuscating it.

The total amount of communication overhead of [15] is by a factor of at least 10 times better than our solutions, however they need multiple rounds (for each layer of the NN) whereas our solutions need only one round in the semi-honest model or even no online-communication at all (using non-interactive OT). On analyzing the communication complexity separately for server and client we see that in our solutions the amount of data sent by the client is much smaller than that of the server and only depends on the number of inputs but not on the size
of NN. The amount of data sent by client is by a factor of five less than that in [15] (as the client in their symmetric protocol sends approximately half of the total data). This asymmetry in the communication exactly corresponds to modern communication networks such as mobile networks or the internet, where the upstream of the client is much slower than its downstream. Downloading the maximum amount of 5.4 MB in (C) realistic for today's mobile networks.

The total time for executing the protocol of (A) and (C) is almost the same as that of [15] while (B) is almost three times faster. While in [15] client has to do only $20 \%$ of the work, in our protocols, server and client need approximately the same amount of computation for creating and evaluating the garbled circuit. Online-computation of our server can be avoided almost completely by constructing the garbled circuit in advance while server is idle. This reduces total execution time of our protocols by half.

Using [9] as underlying SFE protocol or a high-speed SFE implementation such as 11 written in C with elliptic curve based OT would further improve communication and computation complexity of our protocols.

Table 2. Comparison of protocols for secure evaluation of NNs

| Protocol | (15] | (A) | (B) | (C) |
| :---: | :---: | :---: | :---: | :---: |
| Level of privacy | obfuscate | obfuscate | hide structure | hide structure\&size |
| Provably secure | no |  | yes |  |
| Communication (Total) | 76 kB | 4.2 MB | 0.79 MB | 5.4 MB |
| Server (send) | $\approx 38 \mathrm{kB}$ | 4.2 MB | 0.78 MB | 5.4 MB |
| Client (send) | $\approx 38 \mathrm{kB}$ |  | 7.5 kB |  |
| Rounds | 5 |  | 1 (0) |  |
| Computation (Total) | 11.7 s | 11.3 s | 4.0 s | 13.4 s |
| Server | 9.3 s | $\approx 5.7 \mathrm{~s}$ | $\approx 2.0 \mathrm{~s}$ | $\approx 6.7 \mathrm{~s}$ |
| Client | 2.4 s | $\approx 5.7 \mathrm{~s}$ | $\approx 2.0 \mathrm{~s}$ | $\approx 6.7 \mathrm{~s}$ |

Acknowledgments. This work has been supported by the European Union under the FP7-STREP project CACE. We thank Vladimir Kolesnikov and anonymous reviewers for their helpful comments.

## References

1. William Aiello, Yuval Ishai, and Omer Reingold. Priced oblivious transfer: How to sell digital goods. In Advances in Cryptology - EUROCRYPT 2001, volume 2045 of $L N C S$, pages 119-135. Springer, 2001.
2. Yan-Cheng Chang and Chi-Jen Lu. Oblivious polynomial evaluation and oblivious neural learning. In Advances in Cryptology - ASIACRYPT 2001, volume 2248 of $L N C S$, pages 369-384. Springer, 2001.
3. Rich Drewes. An artifical neural network spam classifier, August 2002. http://www.interstice.com/drewes/cs676/spam-nn/.
4. R. Paul Gorman and Terrence J. Sejnowski. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks, 1(1):75-89, 1988.
5. Vipul Goyal, Payman Mohassel, and Adam Smith. Efficient two party and multi party computation against covert adversaries. In Advances in Cryptology - EUROCRYPT 2008, volume 4965 of LNCS, pages 289-306. Springer, 2008.
6. Vandana Gunupudi and Stephen R. Tate. Generalized non-interactive oblivious transfer using count-limited objects with applications to secure mobile agents. In Financial Cryptography and Data Security (FC '08), volume 5143 of LNCS, pages 98-112. Springer, 2008.
7. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359-366, 1989.
8. Yuval Ishai, Joe Kilian, Kobbi Nissim, and Erez Petrank. Extending oblivious transfers efficiently. In CRYPTO 2003, volume 2729 of LNCS. Springer, 2003.
9. Vladimir Kolesnikov and Thomas Schneider. Improved garbled circuit: Free XOR gates and applications. In 35th Int. Colloquium on Automata, Languages and Programming (ICALP '08), volume 5126 of LNCS, pages 486-498. Springer, 2008.
10. Vladimir Kolesnikov and Thomas Schneider. A practical universal circuit construction and secure evaluation of private functions. In Financial Cryptography and Data Security (FC '08), volume 5143 of LNCS, pages 83-97. Springer, 2008. http://thomaschneider.de/FairplayPF
11. Y. Lindell, B. Pinkas, and N. Smart. Implementing two-party computation efficiently with security against malicious adversaries. In Security and Cryptography for Networks (SCN '08), volume 5229 of $L N C S$, pages 2-20. Springer, 2008.
12. Yehuda Lindell and Benny Pinkas. A proof of Yao's protocol for secure two-party computation. Cryptology ePrint Archive, Report 2004/175, 2004.
13. Yehuda Lindell and Benny Pinkas. An efficient protocol for secure two-party computation in the presence of malicious adversaries. In Advances in Cryptology EUROCRYPT 2007, volume 4515 of $L N C S$, pages 52-78. Springer-Verlag, 2007.
14. Dahlia Malkhi, Noam Nisan, Benny Pinkas, and Yaron Sella. Fairplay - a secure two-party computation system. In USENIX, 2004. http://www.cs.huji.ac.il/project/Fairplay/fairplay.html.
15. Claudio Orlandi, Alessandro Piva, and Mauro Barni. Oblivious neural network computing via homomorphic encryption. European Journal of Information Systems (EURASIP), 2007(1):1-10, 2007.
16. Benny Pinkas. Cryptographic techniques for privacy-preserving data mining. SIGKDD Explor. Newsl., 4(2):12-19, 2002.
17. Alessandro Piva, Michele Caini, Tiziano Bianchi, Claudio Orlandi, and Mauro Barni. Enhancing privacy in remote data classification. New Approaches for Security, Privacy and Trust in Complex Environments (SEC '08), 2008.
18. Vassilis P. Plagianakos and Michael N. Vrahatis. Parallel evolutionary training algorithms for "hardware-friendly" neural networks. Natural Computing, 1(2-3):307322, 2002.
19. Michael O. Rabin. How to exchange secrets with oblivious transfer. Technical report, Harward University, 1981. Available at Cryptology ePrint Archive, Report 2005/187.
20. Tomas Sander, Adam Young, and Moti Yung. Non-interactive cryptocomputing for $N C^{1}$. In Proc. 40th IEEE Symp. on Foundations of Comp. Science, FOCS 'g9, pages 554-566, New York, 1999. IEEE.
21. Kazuo Sato and Hiroomi Hikawa. Implementation of multilayer neural network with threshold neurons and its analysis. Artificial Life and Robotics, 3(3):170-175, 1999.
22. Thomas Schneider. Practical secure function evaluation. Master's thesis, University of Erlangen-Nuremberg, 2008. http://thomaschneider.de/theses/da/
23. StatSoft, Inc. STATISTICA Automated Neural Networks, 2008. http://www.statsoft.com/products/stat_nn.html.
24. Joe Tebelskis. Speech Recognition using Neural Networks. PhD thesis, School of Computer Science, Pittsburgh, 1995.
25. Leslie G. Valiant. Universal circuits (preliminary report). In Proc. 8th ACM Symp. on Theory of Computing, STOC '\%6, pages 196-203. ACM Press, 1976.
26. Abraham Waksman. A permutation network. J. ACM, 15(1):159-163, 1968.
27. Wolfram Research, Inc. Mathematica(c) NEURAL NETWORKS, 2008. http://www.wolfram.com/products/applications/neuralnetworks/
28. Andrew C. Yao. How to generate and exchange secrets. In Proc. 27th IEEE Symp. on Foundations of Comp. Science, FOCS '86, pages 162-167, Toronto, 1986. IEEE.

## A Universal Graph based GUC Construction

To construct an asymptotically better UC than the practical constructions presented before we generalize Valiant's UC construction [25]. Valiant shows how to construct a UC by embedding the given circuit into a universal graph $\Gamma_{d}\left(k^{\prime}\right)$ with $\left|\Gamma_{d}\left(k^{\prime}\right)\right| \sim 4.75 d k^{\prime} \log k^{\prime}$, where $d$ is the fan-in and fan-out of the simulated graph and $k^{\prime}$ is the number of simulated nodes. A circuit with $u$ inputs and $k$ gates having fan-in and fan-out maximum $d$ can be represented as such a simulateable graph with $k^{\prime}=k+u$ nodes and embedded into this universal graph. We generalize Valiant's construction (that uses $d=2$ and can simulate circuits with gates having 2 inputs only) to a GUC that simulates circuits with arbitrary fixed in-degree $d$ and arbitrary fan-out of size

$$
\left|U C_{k, u, v \times d}^{\mathbf{U G}}\right| \sim 4.75 d\left(2 k+u+\frac{v}{d-1}\right) \log k .
$$

The circuit is converted from arbitrary fan-out to a circuit with fan-out at most $d$ which can be embedded into the universal graph $\Gamma_{d}\left(k^{\prime}\right)$. This is done by replacing each gate with fan-out $x>d$ by a binary tree of $\left\lceil\frac{x}{d-1}\right\rceil+1$ gates with fan-out $\leq d$. At most $e=k+\frac{v}{d-1}$ extra gates are needed as described in A.1. Setting $k^{\prime}=k+u+e$ results in the stated complexity for the generalized Valiant's UC construction. By setting $d=2$ we obtain exactly the asymptotic complexity of Valiant's original UC construction.

## A. 1 Converting Circuit to Fan-out $\leq d$

As described in a circuit with $s$ gates of arbitrary fan-out and $m$ outputs can be converted into one having gates with fan-out $\leq d$ only, by replacing each gate with fan-out $x>d$ by a binary tree of $\left\lceil\frac{x}{d-1}\right\rceil+1$ gates with fan-out $\leq d$ each.

Theorem 1 in $\$ 4.3$ gives an upper bound for the maximal number of extra gates $e$ added which we prove similar to [25, Fact 3.1 and Corollary 3.1]:

Fact: Suppose that a graph $G$ with fan-in $d$ has among the nodes of in-degree zero, $n^{\prime}$ nodes of nonzero out-degree, and amongst the rest $v^{\prime}$ nodes of out-degree zero, $f_{i}$ of out-degree $i, i=1, . ., d-1$, and $g$ of out-degree greater than $d-1$. Suppose also that $e^{\prime}=\sum($ out - degree $-d)=\sum x$ over the set of nodes with out-degree greater than $d$. Then $e^{\prime} \leq \sum_{i=1}^{d-1}(d-i) f_{i}+d v^{\prime}-n^{\prime}$.

Proof. The total of the out-degrees must equal the total of the in-degrees. The former is $\geq n^{\prime}+\sum_{i=1}^{d-1}\left(i f_{i}\right)+d g+e^{\prime}$, and the latter is $\leq d\left(v^{\prime}+\sum_{i=1}^{d-1} f_{i}+g\right)$.

Corollary: $e \leq k+\frac{v}{d-1}$
Proof. Any gate with fan-out $x+d, x>0$, can be replaced by a binary tree with $\left\lceil\frac{x}{d-1}\right\rceil+1 \leq \frac{x}{d-1}+2$ gates. Hence for any circuit there is an equivalent one having fan-out $d$ and at most $e \leq \frac{e^{\prime}}{d-1}+g$ more gates. But the total number of gates $k=\sum_{i=1}^{d-1} f_{i}+g+v^{\prime}$ and in any minimal circuit $v^{\prime} \leq v$. Hence at most

$$
e \leq \frac{e^{\prime}}{d-1}+g \leq \frac{d}{d-1} v^{\prime}+\sum_{i=1}^{d-1} \frac{d-i}{d-1} f_{i}+g \leq s+\frac{d}{d-1} v^{\prime}-v^{\prime} \leq k+\frac{v}{d-1}
$$

extra gates are needed which completes the proof of the corollary and Theorem 1 .

## B Tables and Figures

Table 3. Programmable switching blocks

| Block |  | Name of Block | Size | $\begin{gathered} \text { Dupli- } \\ \text { cates } \end{gathered}$ | $\begin{aligned} & \mathrm{Or}- \\ & \text { der } \end{aligned}$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| $S_{1}^{u}$ | (10) | Selection | u-1 | X | X |
| $S_{v}^{u}$ | 10] | Selection (simple) | $v(u-1)$ | X | X |
| $S_{v \geq u}^{u}$ | 10 | Selection (efficient) | $(u+v) \log u+2 v \log v-2 u-v+3$ | X | X |
| $S_{v}^{u \sum v}$ | 10 | Selection (efficient) | $(u+3 v) \log v+u-4 v+3$ | X | X |
| $S_{2 u}^{u}$ | 10 | Selection (improved) | $6 u \log u+3$ | X | X |
| $P_{u}^{u}$ | [26] | Permutation | $2 u \log u-2 u+2$ | - | X |
| $E P_{v \geq u}^{u}$ | 10 | Permutation (expanded) | $(u+v) \log u-2 u+2$ | - | X |
| $C_{d}^{u, \text { simple }}$ | (\$3) | Choice (simple) | $d u-d^{2}$ | - | - |
| $C_{d}^{u, \text { sublin }}$ | ( 83 | Choice (sub-linear) | $(\log d+1) u-3 d+2$ | - | - |
| $B P_{v \times d}^{u \geq v d}$ | ( 93 3) | Bundle Permutation | $(u+d v) \log v+(\log d+1) u-3 d v+2$ | - | - |
| $B S_{u \times d}^{u, \text { choice }}$ | ( $¢$ | Bundle Selection (choice) | u $\cdot\left\|C_{d}^{u}\right\|$ | X | - |
| $B S_{u \times d}^{u, \text { perm }}$ | ( C ( | Bundle Selection (perm) | $(2.5 d+1) u \log u+(d \log d-d-2) u+3$ | X | - |



Fig. 4. Comparison of GUC constructions for different parameters

## C Recursive Generalized Universal Block Construction

Before showing how a UB can be constructed recursively as generalization of the construction in [10], we introduce bundle selection block, a new bundle block.
$\boldsymbol{B} \boldsymbol{S}_{\boldsymbol{u \times d}}^{\boldsymbol{u}}$ Bundle Selection Block is a generalization of selection block. It is a programmable block that can be programmed to select for each of the $u$ bundles of outputs (with $d$ wires per bundle) any of the $u$ inputs (with duplicates between different bundles). More formally, when programmed with a tuple of subsets of the inputs $\Sigma=\left(\Sigma_{1}, . ., \Sigma_{u}\right)$ with $\Sigma_{i} \subseteq\{1, . ., u\},\left|\Sigma_{i}\right|=v ; i=1, . ., u$, bundle selection block computes $B S\left(i n_{1}, . ., i n_{u}\right)=\left(i n_{\sigma_{1,1}}, . ., i n_{\sigma_{1, d}}, . ., i n_{\sigma_{u, 1}}, . ., i n_{\sigma_{u, d}}\right)$, where $\Sigma_{i}=\left\{\sigma_{i, 1}, . ., \sigma_{i, d}\right\} ; i=1, . ., u$.

A simple implementation of a $B S_{u \times d}^{u}$ bundle selection block, $B S_{u \times d}^{u, \text { choice }}$, uses $u$ choice blocks $C_{d}^{u}$ in parallel to choose for each of the $u$ bundles any $d$ of the $u$ inputs. $\left|B S_{u \times d}^{u, \text { choice }}\right|=u \cdot\left|C_{d}^{u}\right|$.

An improved construction for a $B S_{u \times d}^{u}$ bundle selection block, $B S_{u \times d}^{u, \text { perm }}$ can be obtained analogous to the improved selection block $S_{2 u}^{u}$ construction of 10 , Sect. 4.2] from bundle permutation blocks as shown in Fig. 5. $\left|B S_{u \times d}^{u, \text { perm }}\right|=$ $\left|E P_{u d / 2}^{u}\right|+(u d-1) \cdot|Y|+\left|B P_{u \times d}^{u d}\right|=(2.5 d+1) u \log u+(d \log d-d-2) u+3$. The programming of this construction can be reduced to a similar box-packing problem of $u$ boxes with size $c_{i} \in\{0, \cdots, d u\} ; i=1, . ., u$ and $\sum_{i=1}^{u} c_{i}=d u$ in a rectangular $2 \times d u / 2$ grid. This can be efficiently solved with the box-packing algorithm [10, Algorithm 1]. The $S_{2 u}^{u}$ construction of [10, Sect. 4.2] including the optimization of [10, Sect. 4.3] is a special case of the general construction $B S_{u \times d}^{u}$ for $d=2$. Our construction is slightly more efficient due to smaller $B P_{u \times d}^{u d}$.


Fig. 5. Bundle selection block $B S_{u \times d}^{u d}$

Recursive Generalized Universal Block Construction. A generalized UB can be constructed similar to the recursive UB construction of [10, Sect. 3.1] as shown in Fig. 6. The $U_{k \times d}$ block to simulate $k$ gates is composed out of two $U_{k / 2 \times d}$ blocks to simulate $k / 2$ gates each. The upper $U_{k / 2 \times d}$ block's inputs and outputs are directly connected to the first half of the surrounding $U_{k \times d}$ block's interface - the outputs of the lower $U_{k / 2 \times d}$ block provide the second half of the outputs. This construction exactly corresponds to the definition of the interface for the generalized UB. Each input to the lower $U_{k / 2 \times d}$ block can now be either any output of the upper $U_{k / 2 \times d}$ block or the corresponding input from the second half of the outer $U_{k \times d}$ block's inputs. Which value is chosen depends on the topology of the circuit. Any combination can be realized by efficiently programming the $B S_{k / 2 \times d}^{k / 2}$ block and the $M_{k d}$ mixing block correspondingly (similar to [10, Sect. 3.1]).

This construction is applied recursively until $k=1$ where a gate simulation block is used as base of the recursion: $U_{1 \times d}=G$.

$$
\begin{aligned}
\left|U_{k \times d}^{\mathrm{M3}}\right|= & (0.625 d+0.25) k \log ^{2} k+(0.5 d \log d-0.625 d-1.25) k \log k+3 k \\
& -3+k \cdot|G| \sim 0.625 d k \log ^{2} k
\end{aligned}
$$



Fig. 6. Recursive generalized universal block construction

The $U_{k}$ construction of [10, Sect. 3.1] including the optimization of [10, Sect. $4.3]$ is a special case of our general construction $U_{k \times d}$ for $d=2$. Our general construction is slightly more efficient because of our slightly more efficient $B S_{k / 2 \times d}^{k / 2}$.

## D Formulae

Arithmetic series

$$
\begin{equation*}
\sum_{i=0}^{k} i=\frac{k(k+1)}{2} \tag{1}
\end{equation*}
$$

Geometric series

$$
\begin{align*}
\sum_{i=0}^{k} q^{i} & =\frac{1-q^{k+1}}{1-q}  \tag{2}\\
\Rightarrow \sum_{i=1}^{k}\left(\frac{1}{2}\right)^{i} & =-1+\frac{1-\left(\frac{1}{2}\right)^{k+1}}{1-\frac{1}{2}}=1-2^{-k} \tag{3}
\end{align*}
$$

Hypergeometric series

$$
\begin{align*}
S_{k} & =\sum_{i=1}^{k} \frac{i}{2^{i}}=\frac{1}{2}+\frac{1}{2} \sum_{i=1}^{k-1} \frac{i+1}{2^{i}}=\frac{1}{2}+\frac{1}{2} \sum_{i=1}^{k-1}\left(\frac{1}{2}\right)^{i}+\frac{1}{2} \sum_{i=1}^{k-1} \frac{i}{2^{i}} \stackrel{\sqrt[3]{2}}{=} 1-\frac{1}{2^{k}}+\frac{1}{2} S_{k-1} \\
& \stackrel{k \rightarrow \infty}{\Longrightarrow} S_{k} \tag{4}
\end{align*}
$$

## E Circuit Constructions and Sizes

## E. 1 Universal Circuit Constructions

Iterative GUC Constructions.

$$
\begin{aligned}
\left|U C_{k, u, v \times d}^{\mathbf{I 1}}\right| & =k \cdot|G|+\left|S_{v}^{k \geq v}\right|+\sum_{i=0}^{k-1}\left|C_{d}^{u+i, \text { simple }}\right| \\
& =k \cdot|G|+\left|S_{v}^{k \geq v}\right|+\sum_{i=0}^{k-1} d(u+i-d) \\
& \stackrel{\text { 1] }}{ } k \cdot|G|+((k+3 v) \log v+k-4 v+3)+\left(k u d-k d^{2}+d \frac{k(k-1)}{2}\right) \\
& =0.5 d k^{2}+\left(d u-d^{2}-0.5 d+1\right) k+(k+3 v) \log v-4 v+3+k \cdot|G|
\end{aligned}
$$

$$
\begin{aligned}
&\left|U C_{k, u, v \times d}^{\mathbf{1 2}}\right|= k \cdot|G|+\left|S_{v}^{k \geq v}\right|+\sum_{i=0}^{k-1}\left|C_{d}^{u+i, \text { sublin }}\right| \\
&= k \cdot|G|+\left|S_{v}^{k \geq v}\right|+\sum_{i=0}^{k-1}((\log d+1)(u+i)-3 d+2) \\
& \text { (1) } k \cdot|G|+((k+3 v) \log v+k-4 v+3) \\
& \quad+\left((u \log d+u-3 d+2) k+(\log d+1) \frac{k(k-1)}{2}\right) \\
&=(0.5 \log d+0.5) k^{2}+(u \log d+u-0.5 \log d-3 d+2.5) k \\
& \quad+(k+3 v) \log v-4 v+3+k \cdot|G|
\end{aligned}
$$

## Modular GUC Constructions.

$$
\begin{gathered}
\left|U C_{k, u, v \times d}^{\mathrm{Mi}}\right|=\left|S_{d k \geq u}^{u}\right|+\left|U_{k \times d}^{\mathrm{Mi}}\right|+\left|S_{v}^{k \geq v}\right| \\
=((u+d k) \log u+2 d k \log (d k)-2 u-d k+3)+\left|U_{k \times d}^{\mathrm{Mi}}\right| \\
\quad+((k+3 v) \log v+k-4 v+3) \\
=\begin{array}{c}
2 d k \log k+(2 d \log d-d+1) k+(d k+u) \log u \\
\\
\quad+(k+3 v) \log v-2 u-4 v+6+\left|U_{k \times d}^{\mathrm{Mi}}\right|
\end{array} \\
\left\lvert\, \begin{aligned}
&\left|U_{k \times d}^{\mathrm{M} 1}\right|=k \cdot|G|+\sum_{i=0}^{k-1}\left|C_{d}^{d+i, \text { simple }}\right| \\
&=k \cdot|G|+\sum_{i=0}^{k-1} d(d+i-d) \\
& \text { (1) } 0.5 d k^{2}-0.5 d k+k \cdot|G|
\end{aligned}\right.
\end{gathered}
$$

$$
\begin{aligned}
\left|U_{k \times d}^{\mathrm{M} 2}\right| & =k \cdot|G|+\sum_{i=0}^{k-1}\left|C_{d}^{d+i, \text { sublin }}\right| \\
& =k \cdot|G|+\sum_{i=0}^{k-1}((\log d+1)(d+i)-3 d+2) \\
& \stackrel{\text { 11 }}{=} k \cdot|G|+(d \log d-2 d+2) k+(\log d+1) \frac{k(k-1)}{2} \\
& =(0.5 \log d+0.5) k^{2}+(d \log d-0.5 \log d-2 d+1.5) k+k \cdot|G|
\end{aligned}
$$

$$
\begin{aligned}
& \left|U_{k \times d}^{\mathrm{M3}}\right|=2 \cdot\left|U_{k / 2 \times d}\right|+\left|B S_{k / 2 \times d}^{k / 2, \text { perm }}\right|+\left|M_{k d / 2}\right| \\
& =k \cdot|G|+\sum_{i=0}^{\log (k)-1} 2^{i}\left(\left|B S_{k / 2^{i+1} \times d}^{k / 2^{i+1}, \mathbf{p e r m}}\right|+\left|M_{k d / 2^{i+1}}\right|\right) \\
& =k \cdot|G|+\sum_{i=0}^{\log (k)-1} 2^{i}\left((2.5 d+1) \frac{k}{2^{i+1}} \log \frac{k}{2^{i+1}}+(d \log d-d-2) \frac{k}{2^{i+1}}\right. \\
& \left.+3+d \frac{k}{2^{i+1}}\right) \\
& =k \cdot|G|+\sum_{i=0}^{\log (k)-1}\left((2.5 d+1) \frac{k}{2} \log \frac{k}{2^{i+1}}+\frac{k}{2}(d \log d-2)+3 \cdot 2^{i}\right) \\
& \text { (2) } k \cdot|G|+(1.25 d+0.5) k \log ^{2} k+(0.5 d \log d-1) k \log k+3(k-1) \\
& -(1.25 d+0.5) k \sum_{i=1}^{\log (k)} i \\
& \text { (1) }(0.625 d+0.25) k \log ^{2} k+(0.5 d \log d-0.625 d-1.25) k \log k+3 k \\
& -3+k \cdot|G|
\end{aligned}
$$

## E. 2 Circuit Implementation of Neurons

$$
\begin{aligned}
& \left|A D D_{s+1}\right|=|H A|+(s-1) \cdot|F A|=2+4(s-1)=4(s+1)-6 \\
& |\tau|=|H C|+(s+\log d-1) \cdot|F C|=0.5+(s+\log d-1)=s+\log d-0.5 \\
& |\Sigma|=\sum_{i=1}^{\log d} \frac{d}{2^{i}} \cdot\left|A D D_{s+i}\right| \\
& =\sum_{i=1}^{\log d} \frac{d}{2^{i}}(4(s+i)-6) \\
& =(4 s-6) d \sum_{i=1}^{\log d}\left(\frac{1}{2}\right)^{i}+4 d \sum_{i=1}^{\log d} \frac{i}{2^{i}} \\
& \stackrel{\sqrt{3}, \sqrt[4]{4}}{<}(4 s-6) d\left(1-\frac{1}{d}\right)+4 d \cdot 2 \\
& =4 d s+2 d-4 s+6 \\
& \left|N_{d, s}^{\text {block }}\right|=d \cdot|w|+|\Sigma|+|\tau| \\
& <(0.5 d s)+(4 d s+2 d-4 s+6)+(s+\log d-0.5) \\
& =4.5 d s+2 d-3 s+\log d+5.5
\end{aligned}
$$


[^0]:    ${ }^{1}$ Clearly, evaluating $f$ on different inputs $x_{i}$ allows $R$ to learn some information on $f$. Thus, $P$ should restrict the maximum number of evaluations of $f$ by other means.

