Mode
⚡ Intuition (math-light)
🔬 Intermediate
🎓 Advanced (math-heavy)
Intuition: Concepts & intuition — great for CS undergrads.

Polynomial Networks

Find out how high-degree polynomial expansions replace activation functions, how tensor decompositions make them tractable, and why they match state-of-the-art architectures on challenging benchmarks.

🔒 Encryption Compatible 📈 >81% ImageNet (MONet) 📐 Sample complexity results ⚡ No elementwise Activation Functions
𝒲 [n] d × d × ⋯ × d * x₁ * x₂ f(z) = C *ᵢ (Uᵢz) degree-k polynomial parameter tensor

Contents

Section 1

Why Polynomial Networks?

Standard deep networks compose linear transformations with pointwise nonlinear activations. Polynomial networks offer a fundamentally different paradigm.

The Core Idea

A polynomial network is a function approximator whose output is an explicit, high-degree multivariate polynomial of the input. Instead of relying on nonlinear activation functions, they use the Hadamard (element-wise) product of linear projections to generate polynomial interactions.

"If we multiply two linear maps element-wise, we get a quadratic polynomial. Compose this idea recursively through \(k\) layers, and the output becomes a polynomial of degree \(k\), without a single activation function."

Building Up: How Degrees Share Structure

One of the most important intuitions in polynomial networks is how higher-degree models build upon and share structure with lower-degree ones. Think of it like a layered cake:

  • 1 Degree 1 uses a single learnable factor \(\bU_{[1]}\). This gives a linear map: \(\bx_1 = \bU_{[1]}^T\bz\).
  • 2 Degree 2 reuses the exact same \(\bU_{[1]}\) from degree 1 and adds one new factor \(\bU_{[2]}\). The second-degree term is formed by multiplying \(\bU_{[2]}^T\bz\) element-wise with the first-degree output \(\bx_1\).
  • 3 Degree 3 reuses \(\bU_{[1]}\) and \(\bU_{[2]}\) from degree 2, adding only one new factor \(\bU_{[3]}\). Each new degree introduces just one new factor matrix.

In general, to go from degree \(N-1\) to degree \(N\), the only truly new parameters are a single new factor matrix \(\bU_{[N]}\) — all previous factors \(\bU_{[1]}, \ldots, \bU_{[N-1]}\) are shared. This recursive sharing means a degree-\(N\) model has only \(O(N)\) factor matrices, not the \(O(d^N)\) parameters of a naive polynomial.

Notice that this is one way of sharing structure; there are multiple others and some of them have been explored in the literature.

Brief historical note - click to reveal

Historical Roots

The concept of learnable polynomial relationships predates modern deep learning by decades. Ivakhnenko (1971) [1] introduced the Group Method of Data Handling (GMDH), often cited as one of the earliest deep learning architectures, which learned quadratic relationships between predefined input pairs. Shin & Ghosh (1991) [2] formalized this into the Pi-Sigma Network. Neither approach scaled to high-dimensional signals like images.

For more thorough historical reviews into polynomial networks, please check the references [21] (sec. V, subsection F) and [22] (sec. 2). Additionally, Fan et al. [26] provide a detailed analysis of the parametric efficiency of quadratic networks, showing that a single quadratic neuron can represent functions that would require exponentially many standard neurons.

In this tutorial, we review more recent works, starting from Chrysos et al. (2020) [3]. In [3], they observed that prominent architectures, such as StyleGAN [4], already implicitly compute polynomial operations. They formalized this connection and introduced Π-Nets, a general framework that links tensor decompositions to tractable polynomial architectures. Subsequent work established the first theoretical guarantees [5,6] and a state-of-the-art architecture that competes with transformers [7].

Four Key Advantages

Why would we want to use this class of functions?
🔒 Encrypted Computation (FHE)

Leveled Fully Homomorphic Encryption [8] supports only addition and multiplication. Standard networks with ReLU require costly polynomial approximations. Polynomial networks run natively under FHE without any modification.

📈 High-Frequency Learning

Standard networks have a well-documented spectral bias toward low-frequency functions. Polynomial networks provably learn high-frequency signals faster, as shown by their NTK analysis [6]. This explains their empirical superiority in image generation and face recognition, where fine-grained details matter.

🔍 Symbolic Interpretability

Because the output is an explicit polynomial, a trained polynomial network can be exploited in polynomial neural ODEs [23]. MONet [7] recovers the exact equations of a dynamic system (e.g., Lotka–Volterra) from data — to five decimal places.

🏆 Strong Performance

MONet (ICLR 2024) [7] achieves 81.3% top-1 on ImageNet-1K with no activation functions matching DeiT-S at similar scale.


Section 2

The Hadamard Product

The Hadamard product is the elementary operation that turns linear transformations into polynomial ones. Understanding it deeply is the key to everything that follows.

Definition — Hadamard Product

For two vectors \(\bu, \bv \in \Real^m\), the Hadamard product \(\bu \had \bv \in \Real^m\) is defined element-wise:

\[(\bu \had \bv)_i = u_i \cdot v_i, \quad i = 1, \ldots, m.\]

For matrices \(\bA, \bB \in \Real^{m \times n}\) of the same size:

\[(\bA \had \bB)_{ij} = a_{ij} \cdot b_{ij}.\]

The symbol \(\had\) denotes the Hadamard product throughout this tutorial, consistent with the notation in [3,9, 10, 22].

u u₁ u₂ u₃ * v v₁ v₂ v₃ = u * v u₁ · v₁ u₂ · v₂ u₃ · v₃ same shape as inputs
Figure 1. The Hadamard product \(\bu \had \bv\) multiplies corresponding entries. The output has the same shape as both inputs — no expansion of dimensions.

Intuition: Multiplying Creates Polynomials

Take an input \(\bz\) and apply two separate linear projections: \(\bu = \bU\bz\) and \(\bv = \bV\bz\). Each entry of \(\bu\) and \(\bv\) is a linear function of \(\bz\). Now multiply them element-wise:

\[(\bu \had \bv)_i \;=\; u_i \cdot v_i \;=\; \underbrace{(\text{linear in }\bz)}_{\text{degree 1}} \times \underbrace{(\text{linear in }\bz)}_{\text{degree 1}} \;=\; \text{degree 2 in }\bz.\]

Each entry is now quadratic in the input — without any activation function! Repeat with a third projection and you get degree 3, and so on.

How the Hadamard Product Creates Polynomial Interactions

This is the central insight of the entire tutorial. Let \(\bz \in \Real^d\) be the network input, and apply two independent linear projections: \(\bu = \bU\bz\) and \(\bv = \bV\bz\), where \(\bU, \bV \in \Real^{m \times d}\). The \(i\)-th entry of their Hadamard product is:

\[ (\bu \had \bv)_i = (\bU\bz)_i \cdot (\bV\bz)_i = \underbrace{\Bigl(\sum_{j=1}^d U_{ij} z_j\Bigr)}_{\text{linear in }\bz} \cdot \underbrace{\Bigl(\sum_{k=1}^d V_{ik} z_k\Bigr)}_{\text{linear in }\bz} = \sum_{j,k=1}^d U_{ij} V_{ik}\, z_j z_k. \]

Each entry is a sum of products of two input coordinates — exactly a degree-2 polynomial in \(\bz\), with no activation function involved. Iterating \(k\) times:

\[ \Bigl(\bU_1\bz \had \bU_2\bz \had \cdots \had \bU_k\bz\Bigr)_i = \prod_{\ell=1}^k (\bU_\ell \bz)_i = \sum_{j_1,\ldots,j_k} U_{1,ij_1} \cdots U_{k,ij_k}\, z_{j_1}\cdots z_{j_k}, \]

a degree-\(k\) polynomial in \(\bz\).

Key Insight

Each application of the Hadamard product between two linear maps increases the polynomial degree by one. After \(k\) applications, the output is a degree-\(k\) polynomial of the input.

Hadamard Products in Modern Architectures

The polynomial structure created by Hadamard products is ubiquitous in deep learning, often hidden in plain sight. Recognizing these connections was a key contribution of the polynomial networks program [3,9,10, 22]:

ArchitectureWhere \(\had\) AppearsEffect
StyleGAN [4]AdaIN: \(\hat{h} = \gamma(\bz) \had \text{norm}(h) + \beta(\bz)\)Style-conditional generation; each AdaIN is an NCP block
LSTM / GRUGating: \(h_t = f_t \had h_{t-1} + i_t \had \tilde{c}_t\)Selective memory via multiplicative interactions
SENet / CBAMChannel attention: \(\bx \had \sigma(\text{FC}(\text{pool}(\bx)))\)Feature recalibration
Non-local blocksPoly-NL [11]: approximated by 3rd-degree polynomialCaptures long-range dependencies at linear complexity
Multiplicative interactions [12]Second-degree: \((\bW_1 \bz) \had (\bW_2 \bz)\)Enlarge hypothesis space; faster learning of certain function classes

The key realization of [3] is that StyleGAN's AdaIN layers are exactly instances of the NCP model (§5): the output of each resolution block is a polynomial of the input latent code \(\bz\).

Relationship to Other Matrix Products (Advanced)

The Hadamard product is one member of a family of fundamental products, all of which appear in the theory of polynomial networks [5]:

SymbolNameDefinitionOutput Shape
\(\bA \had \bB\)Hadamard (element-wise)\((\bA \had \bB)_{ij} = a_{ij} b_{ij}\)\(m \times n\)
\(\bA \otimes \bB\)KroneckerBlock matrix \([a_{ij}\bB]\)\(mp \times nq\)
\(\bA \kr \bB\)Khatri–Rao (column-wise)Column \(j\): \(\mathbf{a}_j \otimes \mathbf{b}_j\)\(mp \times n\)
\(\bA \fs \bB\)Face-splittingRow \(i\): \(\mathbf{a}_i^T \otimes \mathbf{b}_i^T\)\(m \times n^2\)

The mixed product property connecting Hadamard and Khatri–Rao products [13] is used in the generalization proofs of §6:

\[(\bA_1\bB_1) \had (\bA_2\bB_2) = (\bA_1 \fs \bA_2)(\bB_1 \kr \bB_2).\]

Brief Intro into Tensors

Already comfortable with tensors and CP decompositions? Jump to §3 — Polynomial Network Architectures.

What Is a Tensor?

A tensor is a multi-dimensional array of numbers — a generalization of vectors (1-D) and matrices (2-D) to arbitrary order:

  • 0️⃣ Scalar (order 0): a single number \(a \in \Real\)
  • 1️⃣ Vector (order 1): a 1-D array \(\mathbf{v} \in \Real^d\)
  • 2️⃣ Matrix (order 2): a 2-D array \(\mathbf{W} \in \Real^{m \times n}\)
  • 🔢 \(N\)th-order tensor: an \(N\)-D array \(\cX \in \Real^{I_1 \times I_2 \times \cdots \times I_N}\)
a Scalar (order 0) Vector (order 1) Matrix (order 2) 𝒳 I₁ I₂ I₃ 3rd-order tensor 𝒳 ∈ ℝ^{I₁×…×Iₙ} Nth-order tensor
Figure 2. Tensors of increasing order. The polynomial parameter tensor \(\cW^{[n]}\) for a degree-\(n\) network is an \(n\)th-order tensor.

Why do we need Tensors?

To represent a degree-\(N\) multivariate polynomial \(G: \Real^d \to \Real\), we need one coefficient for every monomial \(z_{i_1} z_{i_2} \cdots z_{i_N}\), \(i_1,\ldots,i_N \in \{1,\ldots,d\}\). Intuitively, we are scaling each monomial with a learnable parameter. Doesn't this create a lot of learnable parameters? Yes, it does. This is why we need to use parameter reduction techniques, e.g., by sharing parameters. This is the tensor factorization part that we can skip for now. It's sufficient to know that we can reduce the learnable parameters significantly.

Building Blocks: Parameter Sharing Across Degrees

A crucial design insight is parameter sharing. Consider building a degree-3 polynomial. The degree-1 term uses a projection matrix \(\bU_1\). The degree-2 term reuses \(\bU_1\) and adds a new \(\bU_2\). The degree-3 term reuses both and adds only \(\bU_3\).

To represent a degree-\(N\) multivariate polynomial \(G: \Real^d \to \Real\), we need one coefficient for every monomial \(z_{i_1} z_{i_2} \cdots z_{i_N}\), \(i_1,\ldots,i_N \in \{1,\ldots,d\}\). These coefficients naturally form an \(N\)th-order tensor \(\cW^{[N]} \in \Real^{d \times d \times \cdots \times d}\):

\[ G(\bz) = \sum_{n=1}^{N} \cW^{[n]} \prod_{j=2}^{n+1} \times_j \bz + \bbeta, \quad \cW^{[n]} \in \Real^{o \times d^n}, \]

where \(\times_j\) denotes the mode-\(j\) tensor-vector product. For output \(x_j\), this expands to:

\[ x_j = G(\bz)_j = \beta_j + \bw_j^{[1]T}\bz + \bz^T \bW_j^{[2]} \bz + \cW_j^{[3]} \times_2 \bz \times_3 \bz + \cdots \]

The tensor \(\cW^{[n]}\) has \(d^n\) entries — exponential in \(n\). Tensor decompositions impose structured low-rank constraints to make this tractable.

Mode-m Product

The mode-\(m\) product of tensor \(\cX \in \Real^{I_1 \times \cdots \times I_N}\) with matrix \(\bU \in \Real^{J \times I_m}\) produces a tensor of shape \(I_1 \times \cdots \times I_{m-1} \times J \times I_{m+1} \times \cdots \times I_N\):

\[(\cX \times_m \bU)_{i_1 \cdots i_{m-1}\, j\, i_{m+1} \cdots i_N} = \sum_{i_m=1}^{I_m} x_{i_1 \cdots i_N}\, u_{j,i_m}.\]

Tensor Decompositions

Tensor decompositions impose structured low-rank factorizations that reduce the exponential parameter cost to polynomial — making polynomial networks practical.

CP Decomposition

The CANDECOMP/PARAFAC (CP) decomposition [14] expresses any tensor as a sum of rank-one components, each of which is the outer product of vectors:

CP Decomposition

A tensor \(\cX \in \Real^{I_1 \times \cdots \times I_N}\) with CP rank \(R\):

\[\cX = \sum_{r=1}^{R} \mathbf{a}_r^{(1)} \otimes \mathbf{a}_r^{(2)} \otimes \cdots \otimes \mathbf{a}_r^{(N)},\]

where \(\mathbf{a}_r^{(n)} \in \Real^{I_n}\). Element-wise: \(x_{i_1,\ldots,i_N} = \sum_{r=1}^R \prod_{n=1}^N a_{i_n,r}^{(n)}\).

Parameter count: \(R \sum_{n=1}^N I_n\) instead of \(\prod_{n=1}^N I_n\) — a dramatic reduction when \(R \ll \min_n I_n^{N-1}\).

Tucker and Tensor Train

Two other decompositions are relevant to polynomial networks:

Tucker Decomposition

\[\cX = \cG \times_1 \bA^{(1)} \times_2 \bA^{(2)} \times_3 \bA^{(3)},\] where \(\cG\) is a dense core tensor and \(\bA^{(n)}\) are factor matrices. CP is a special case where \(\cG\) is superdiagonal. Tucker is more expressive but requires more parameters.

Tensor Train (TT)

\[x_{i_1,\ldots,i_N} = G_1(i_1)\,G_2(i_2)\cdots G_N(i_N),\] where each \(G_n(i_n) \in \Real^{r_{n-1} \times r_n}\) is a matrix slice of a 3rd-order core. Parameters scale as \(O(N I r^2)\) vs. \(O(I^N)\). TT is the basis for TT-LSTM and sequential polynomial models.


Section 3

Polynomial Network Architectures

Multiple architectures emerge from different tensor decompositions of the polynomial parameter tensor.

Notation Summary

SymbolSpaceMeaning
\(\bzeta \in \Real^\delta\)\(\Real^\delta\)Raw input to the polynomial approximator
\(\cW^{[n]}\)\(\Real^{o \times {\delta}^n}\)Degree-\(n\) polynomial parameter tensor (calligraphic = tensor)
\(\bU_{[n]}\)\(\Real^{\delta \times m}\)Factor matrices (CP decomposition), \(m\) = hidden rank
\(\bx_n\)\(\Real^m\)Hidden state at degree level \(n\)
\(k\)\(\mathbb{N}\)Polynomial degree
\(\had\)Hadamard (element-wise) product

Notation Summary

SymbolSpaceMeaning
\(\bzeta \in \Real^\delta\)\(\Real^\delta\)Raw input to the polynomial approximator
\(\bz = [\bzeta^T, 1]^T\)\(\Real^d\), \(d=\delta+1\)Augmented input (absorbs bias terms)
\(\cW^{[n]}\)\(\Real^{o \times d^n}\)Degree-\(n\) polynomial parameter tensor (calligraphic = tensor)
\(\bU_{[n]}\)\(\Real^{d \times m}\)Factor matrices (CP decomposition), \(m\) = hidden rank
\(\bC\)\(\Real^{o \times m}\)Output/readout matrix
\(\bx_n\)\(\Real^m\)Hidden state at degree level \(n\)
\(k\)\(\mathbb{N}\)Polynomial degree
\(\had\)Hadamard (element-wise) product

The Full Polynomial (Starting Point)

As usual in Machine Learning, we aim to learning a function that accepts an input in a high-dimensional space and then outputs a prediction (also in a high-dimensional space). In other words, we want to learn \(G: \Real^\delta \to \Real^o\) such that each output \(x_j\) is a polynomial of all inputs \(\zeta_i\) (following [3], eq. (3)):

\[ x_j = G(\bzeta)_j = \beta_j + \bw_j^{[1]T}\bzeta + \bzeta^T\bW_j^{[2]}\bzeta + \cW_j^{[3]} \times_2\bzeta \times_3\bzeta + \cdots \]

More compactly, vectorizing over all outputs (eq. (4) of [3]):

\[ \bx = G(\bzeta) = \sum_{n=1}^{N} \left(\cW^{[n]}\prod_{j=2}^{n+1}\times_j \bzeta\right) + \bbeta, \quad \cW^{[n]} \in \Real^{o \times d^n}. \]

This has \(O(\delta^k)\) parameters for each degree. Tensor decompositions reduce this to practical sizes.

Key Lemma — CP Decomposition to Hadamard Products

The critical mathematical insight connecting CP decompositions to Hadamard-product architectures is the following (Lemma 1 in [3]):

You can convert a product of Khatri-Rao products to Hadamard products, enabling the implementation in most deep learning frameworks. Concretely, for the sets of matrices \(\{\mathbf{R}^{[\nu]} \in \mathbb{R}^{I_{\nu} \times K} \}_{\nu=1}^N\) and \(\{\mathbf{Q}^{[\nu]} \in \mathbb{R}^{I_{\nu} \times L} \}_{\nu=1}^N\), it holds that \begin{equation} \left(\bigodot_{\nu=1}^N \mathbf{R}^{[\nu]}\right)^T \cdot \left(\bigodot_{\nu=1}^N \mathbf{Q}^{[\nu]}\right) = (\mathbf{R}^{[1]T} \cdot \mathbf{Q}^{[1]}) * (\mathbf{R}^{[2]T} \cdot \mathbf{Q}^{[2]}) * \ldots * (\mathbf{R}^{[N]T} \cdot \mathbf{Q}^{[N]}) \label{eq:polygan_suppl_lemma1_N} \end{equation} The \(\bigodot\) represents the Khatri-Rao product.

CCP Model — Coupled CP Decomposition

The Coupled CP-Decomposition (CCP) model [3] is the most fundamental model. It applies a CP decomposition with parameter sharing. After a bit of algebra, the recursive equations (eq. (6) in [3]) are:

The Coupled CP-Decomposition (CCP) model [3] applies a coupled CP decomposition across all polynomial degrees simultaneously, sharing factor matrices. The recursive equations (eq. (6) in [3]) are:

Definition — CCP Model (Chrysos et al., CVPR 2020)
\[ \bx_1 = \bU_{[1]}^T\bzeta, \qquad \bx_n = \bigl(\bU_{[n]}^T\bzeta\bigr) \had \bx_{n-1} + \bx_{n-1}, \quad n = 2,\ldots,k, \] \[ f(\bzeta) = \bQ\bx_k + \bpsi, \]

where \(\bU_{[n]} \in \Real^{\delta \times m}\), \(\bQ \in \Real^{o \times m}\), and \(\bpsi \in \Real^o\) are learnable parameters. The hidden rank \(m\) controls the model capacity.

Expansion: The skip term \(+\bx_{n-1}\) ensures that all polynomial degrees from 1 to \(k\) are represented in the output. Without the skip, the model would be purely degree-\(k\).

Intuitively, instead of factorizing each parameter tensor individually, the Coupled CP decomposition jointly factorizes all the parameter tensors using a specific pattern of factor sharing. By ensuring that the factors corresponding to lower-order levels of approximation are shared across all parameter tensors, the model establishes a recursive relationship that can scale to an arbitrary expansion. This shared recursive formulation avoids the combinatorial explosion in the number of parameters that is normally required to accommodate all higher-order correlations of the input, enabling the efficient construction of a high-degree polynomial expansion.

This simple recursive relationship means the following: each new, higher-degree term is directly built by reusing and multiplying the results from the previous lower-degree terms with the input. By repeatedly applying this recursive step, the model can efficiently construct a very high-degree polynomial expansion without the overwhelming combinatorial explosion in parameters.

NCP Model — Nested Coupled CP Decomposition

The Nested Coupled CP (NCP) model [3] uses a joint hierarchical decomposition that introduces learnable mixing matrices \(\bS_{[n]}\) between levels. This gives strictly richer polynomial interactions (eq. (9) in [3]):

Definition — NCP Model (Chrysos et al., CVPR 2020)
\[ \bx_1 = \bigl(\bA_{[1]}^T\bzeta\bigr) \had \bigl(\bB_{[1]}^T\bb_{[1]}\bigr), \] \[ \bx_n = \bigl(\bA_{[n]}^T\bzeta\bigr) \had \bigl(\bS_{[n]}^T\bx_{n-1} + \bB_{[n]}^T\bb_{[n]}\bigr), \quad n \geq 2, \] \[ f(\bzeta) = \bQ\bx_k + \bpsi, \]

with learnable \(\bA_{[n]} \in \Real^{\delta \times m}\), \(\bS_{[n]} \in \Real^{m \times m}\), \(\bB_{[n]} \in \Real^{\omega \times m}\), \(\bb_{[n]} \in \Real^\omega\), \(\bQ \in \Real^{o \times m}\), \(\bpsi \in \Real^o\). The NCP-Skip variant additionally adds \(\bV_{[n]}\bx_{n-1}\) as a residual connection.

Product of Polynomials (ProdPoly)

Instead of using a single high-degree polynomial, we can express the function as a product of lower-degree polynomials [3]. The product is implemented as successive second-degree polynomials where the output of the \(i\)-th polynomial is the input for the \((i+1)\)-th polynomial. Stacking \(N\) such polynomials (each of degree 2) yields an overall expansion of degree \(2^N\) — achieving exponentially high degrees with linear depth.

Key Benefit of ProdPoly

The product of polynomials has two main advantages: (a) each polynomial can use a different decomposition (CCP, NCP, or NCP-Skip) and expansion order, providing flexibility; (b) it requires much fewer parameters to achieve the same overall degree compared to a single polynomial.

Beyond Single-Input Polynomials: Conditional Generation

The models described so far take a single input variable \(\bzeta\) and produce a single output. However, many real-world tasks require multiple inputs. For instance, in conditional image generation, a generator receives both a noise vector \(\bzeta_I\) (controlling stochastic variation) and a conditional variable \(\bzeta_{II}\) (e.g., a class label, a low-resolution image, or an edge map). Simply concatenating these inputs into one vector discards their distinct roles and spatial structure.

To address this, CoPE [9] introduced a framework that expresses the output as a multivariate polynomial of two (or more) input variables. CoPE captures both auto-correlations within each variable and cross-correlations between them, with a recursive formulation:

\[ \bx_n = \bx_{n-1} + \bigl(\bU_{[n,I]}^T \bzeta_I + \bU_{[n,II]}^T \bzeta_{II}\bigr) \had \bx_{n-1}, \]

where each input variable \(\bzeta_I, \bzeta_{II}\) is embedded separately, allowing different constraints (e.g., \(\bU_{[n,I]}\) as a dense layer for noise, \(\bU_{[n,II]}\) as a convolution for images).

Decoupled Degrees: Augmenting Deep Classifiers

In the standard CCP and NCP models, all polynomial degrees share factor matrices — the same \(\bU_{[1]}\) contributes to the first-degree, second-degree, and all higher-degree terms. But what if different degrees should capture fundamentally different factors?

[10] explored this idea to allow each polynomial degree to have its own independent parameters, so that lower-degree terms can capture coarse, global structure while higher-degree terms independently learn fine-grained discriminative features. This decoupling enables each degree to specialize.

Beyond the Real Domain: Complex-Valued Polynomials

Are polynomial networks limited to real-valued inputs and outputs? The answer is no. In [25], the polynomial framework was extended to the complex domain. Audio signals are naturally represented in the frequency domain as complex-valued spectrograms. By defining polynomial expansions over complex-valued tensors, the same decomposition principles apply — complex-valued Hadamard products create polynomial interactions in the complex field. This demonstrates that the theoretical and architectural advantages of polynomial networks are not restricted to real-valued data and can be leveraged for signals in any field where multiplication is well-defined.

R-PolyNets & D-PolyNets — Regularized Polynomial Networks (CVPR 2023)

A critical challenge for polynomial networks in discriminative tasks (classification) is achieving performance comparable to standard deep networks like ResNet. [17] identified that regularization is a bottleneckand introduced two new model families:

R-PolyNets (Regularized)

R-PolyNets include two regularization matrices \(\boldsymbol{\Phi}\) and \(\boldsymbol{\Psi}\) in the recursive formula, enabling different normalization schemes:

\[\mathbf{x}_n = \bigl(\boldsymbol{\Phi}\bU_{[n]}^T\bzeta\bigr) \had \bigl(\boldsymbol{\Psi}\bS_{[n]}^T\mathbf{x}_{n-1} + \bB_{[n]}^T\bb_{[n]}\bigr) + \mathbf{y}_{n-1}\]

Notice how R-PolyNets directly extend the NCP layer from above.

D-PolyNets (Dense)

D-PolyNets add dense connections across polynomials, inspired by DenseNet. The output of the \(i\)-th polynomial feeds into the Hadamard product of subsequent polynomials:

\[\mathbf{x}_n^{[i]} = \mathbf{x}_{n-1}^{[i]} + \bigl(\boldsymbol{\Phi}\bU_{[n]}^T\bzeta\bigr) \had \bigl(\boldsymbol{\Psi}\bS_{[n]}^T\mathbf{x}_{n-1} + \mathbf{K}_{[n]}^T\mathbf{k}_{[n]}\bigr) \had_{{\tau}=1}^{i-1}\mathbf{x}^{[\tau]}\]

Dense connections enable higher-degree expansions with fewer parameters and improve expressiveness over R-PolyNets.

MONet — Multilinear Operator Network (ICLR 2024)

Are polynomial networks performing on par with transformers or convolutional netorks?

Not yet.

To overcome this, the architecture was completely redesigned into MoNet, placing the Hadamard product at its core. This simple element-wise multiplication enables the network to efficiently capture complex interactions and rival the performance of state-of-the-art models.

MONet [7] departs from the vector-input paradigm and treats the input as a sequence of tokens \(\bX \in \Real^{d \times s}\), consistent with modern vision architectures. Its core layer is the Mu-Layer:

Definition — Mu-Layer (Cheng et al., ICLR 2024)
\[\bY = \bC\bigl[(\bA\bX) \had (\bB\bD\bX) + \bA\bX\bigr],\]

where \(\bA \in \Real^{m \times d}\), \(\bB \in \Real^{m \times l}\), \(\bC \in \Real^{o \times m}\), \(\bD \in \Real^{l \times d}\) are learnable.

Two branches: Branch 1 (\(\bA\bX\), rank \(m\)) captures the primary representation. Branch 2 (\(\bB\bD\bX\), rank \(l \ll m\)) provides a low-dimensional modulation. Their Hadamard product creates quadratic interactions; the skip connection \(+\bA\bX\) preserves linear terms.


Section 4

Theoretical Guarantees

Can we prove generalization and robustness guarantees for this model class or compare their properties with fully-connected networks?

Interesting properties and separations exist between polynomial networks and regular fully-connected networks, e.g. in the spectral bias [6] or the extrapolation [16]. Particularly in the extrapolation, it was shown that polynomial networks can approximate better more complex functions.

Rademacher Complexity

Definition — Empirical Rademacher Complexity

For dataset \(\{\bz_1,\ldots,\bz_n\}\) and i.i.d. Rademacher variables \(\{\sigma_j\} \in \{-1,+1\}^n\):

\[\hat{\mathcal{R}}_n(\mathcal{F}) = \mathbb{E}_\sigma\Bigl[\sup_{f \in \mathcal{F}} \tfrac{1}{n}\sum_{j=1}^n \sigma_j f(\bz_j)\Bigr].\]

Small Rademacher complexity implies good generalization.

Theorem 1 — Rademacher Complexity of CCP (Zhu et al., ICLR 2022 [5])

Let \(\mathcal{F}^k_{\mathrm{CCP}}\) be the class of degree-\(k\) CCP models with \(\|\bc\|_1 \leq \mu\) and \(\|\mathop{\fs}_{i=1}^k \bU_i\|_1 \leq \lambda\), evaluated on inputs with \(\|\bz_j\|_1 \leq 1\). Then:

\[\hat{\mathcal{R}}_n(\mathcal{F}^k_{\mathrm{CCP}}) \leq 2\mu\lambda\sqrt{\frac{2k\log d}{n}}.\]

This grows as \(O(\sqrt{k\log d})\) — dramatically slower than the \(d^k\) parameter count of the naïve polynomial.

Theorem 2 — Lipschitz Constant and Robustness (Zhu et al., ICLR 2022 [5])

The \(\ell_\infty\)-Lipschitz constant of the CCP model satisfies:

\[\mathrm{Lip}_\infty(f) \leq k\|\bC\|_1 \prod_{i=1}^k \|\bU_i\|_1.\]

The same quantities \(\|\bU_i\|_1\) that control generalization (Theorem 1) also control robustness. This motivates a single regularization strategy for both.

Spectral Bias & Extrapolation

Standard Networks: Low-Frequency Bias

Neural networks trained with gradient descent exhibit a spectral bias: they learn low-frequency functions first [15]. In the Neural Tangent Kernel (NTK) framework, this is explained by the eigenvalue decay rate of the NTK — high-frequency eigenfunctions have smaller eigenvalues and are therefore learned exponentially slower.

Theorem — Spectral Advantage of Π-Nets (Choraria et al., ICLR 2022 [6])

For the Π-Net NTK \(\kappa_\pi\), its eigenvalues \(\{\mu_{\pi,k}\}\) (with data uniform on the unit sphere) satisfy, for \(k \gg d\):

\[\mu_{\pi,k} = \Omega(k^{-d/2+2}), \quad \text{vs.}\quad \mu_k^{\mathrm{ReLU}} = \Omega(k^{-d}).\]

The slower decay of \(\mu_{\pi,k}\) implies a larger RKHS — more high-frequency functions are well-approximated. This gives polynomial networks a provable speed advantage for learning high-frequency signals.

Extrapolation

Standard NNs: Linear Extrapolation

For any \(N\)-layer ReLU network trained with squared loss in the NTK regime, the output along any extrapolation direction \(\mathbf{v}\) is a linear function of the extrapolation step \(h\) [16].

Polynomial NNs: Polynomial Extrapolation

For an \(N\)-degree Π-Net in the NTK regime, the output along any extrapolation direction is a polynomial of degree \(\leq N\) in \(h\). Under appropriate training set conditions, exact polynomial extrapolation is achievable (for degree 2) [16].


Section 5

PyTorch Implementations

Code implementation - click to reveal

CCP Layer

Python · PyTorch — CCP Model
import torch
import torch.nn as nn

class CCPLayer(nn.Module):
    """Coupled CP-Decomposition Polynomial Layer.

    Computes the CCP model (Chrysos et al., CVPR 2020):
        x_1   = U_1^T z
        x_n   = (U_n^T z) * x_{n-1} + x_{n-1},  n = 2,...,k
        f(z)  = Q @ x_k + psi

    * is the Hadamard (element-wise) product.
    The augmented input z = [zeta^T, 1]^T absorbs bias terms.
    """
    def __init__(self, in_dim: int, hidden_dim: int,
                 out_dim: int, degree: int = 4):
        super().__init__()
        # Augmented dim: z = [zeta, 1], so in_dim should be original + 1
        self.projections = nn.ModuleList([
            nn.Linear(in_dim, hidden_dim, bias=False)
            for _ in range(degree)
        ])
        self.readout = nn.Linear(hidden_dim, out_dim)
        self._init()

    def _init(self):
        for layer in self.projections:
            nn.init.xavier_normal_(layer.weight, gain=0.5)

    def forward(self, zeta: torch.Tensor) -> torch.Tensor:
        # Augment input: z = [zeta, 1]
        ones = torch.ones(zeta.shape[0], 1, device=zeta.device)
        z = torch.cat([zeta, ones], dim=1)

        x = self.projections[0](z)          # x_1 = U_1^T z
        for proj in self.projections[1:]:
            Uz = proj(z)                    # U_n^T z
            x = Uz * x + x               # ∘ (Hadamard) + skip

        return self.readout(x)             # Q @ x_k + bias

NCP Layer

Python · PyTorch — NCP Model
class NCPLayer(nn.Module):
    """Nested Coupled CP-Decomposition Polynomial Layer.

    Computes the NCP model (Chrysos et al., CVPR 2020):
        x_1 = (A_1^T z) * (B_1^T b_1)
        x_n = (A_n^T z) * (S_n^T x_{n-1} + B_n^T b_n), n >= 2
        f(z) = Q @ x_k + psi

    The mixing matrices S_n generalize the residual update x <- Sx + b to a polynomial by Hadamard-multiplying with a new projection of z.
    
    def __init__(self, in_dim, hidden_dim, out_dim, degree=4):
        super().__init__()
        self.A  = nn.ModuleList([nn.Linear(in_dim, hidden_dim, bias=False) for _ in range(degree)])
        self.S  = nn.ModuleList([nn.Linear(hidden_dim, hidden_dim) for _ in range(degree)])
        self.out = nn.Linear(hidden_dim, out_dim)

    def forward(self, zeta):
        ones = torch.ones(zeta.shape[0], 1, device=zeta.device)
        z    = torch.cat([zeta, ones], dim=1)
        x    = self.A[0](z) * torch.sigmoid(self.S[0].bias.unsqueeze(0))
        for n in range(1, len(self.A)):
            x = self.A[n](z) * self.S[n](x)  # ∘ (A_n z) with (S_n x + b_n)
        return self.out(x)

References

  1. A.G. Ivakhnenko. "Polynomial theory of complex systems." IEEE Transactions on Systems, Man, and Cybernetics, 1971. SMC 1971
  2. Y. Shin and J. Ghosh. "The pi-sigma network: An efficient higher-order neural network for pattern classification and function approximation." In International Joint Conference on Neural Networks (IJCNN), 1991. IJCNN 1991
  3. G.G. Chrysos, S. Moschoglou, G. Bouritsas, Y. Panagakis, J. Deng, and S. Zafeiriou. "Π-nets: Deep polynomial neural networks." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. Extended version: "Deep polynomial networks." IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2021. CVPR 2020 · T-PAMI 2021
    Code
  4. T. Karras, S. Laine, and T. Aila. "A style-based generator architecture for generative adversarial networks." In CVPR, 2019. CVPR 2019
    arXiv:1812.04948
  5. Z. Zhu, F. Latorre, G.G. Chrysos, and V. Cevher. "Controlling the complexity and Lipschitz constant improves polynomial nets." In International Conference on Learning Representations (ICLR), 2022. ICLR 2022
    OpenReview
  6. M. Choraria, L.T. Dadi, G.G. Chrysos, J. Mairal, and V. Cevher. "The spectral bias of polynomial neural networks." In International Conference on Learning Representations (ICLR), 2022. ICLR 2022
    OpenReview
  7. Y. Cheng, G.G. Chrysos, M. Georgopoulos, and V. Cevher. "Multilinear operator networks." In International Conference on Learning Representations (ICLR), 2024. ICLR 2024
    OpenReview
  8. Z. Brakerski, C. Gentry, and V. Vaikuntanathan. "(Leveled) fully homomorphic encryption without bootstrapping." ACM Transactions on Computation Theory (TOCT), 2014. TOCT 2014
  9. G.G. Chrysos, M. Georgopoulos, and Y. Panagakis. "Conditional generation using polynomial expansions." In Advances in Neural Information Processing Systems (NeurIPS), 2021. NeurIPS 2021
    Code
  10. G.G. Chrysos, M. Georgopoulos, J. Deng, J. Kossaifi, Y. Panagakis, and A. Anandkumar. "Augmenting Deep Classifiers with Polynomial Neural Networks." In European Conference on Computer Vision (ECCV), 2022. ECCV 2022
  11. F. Babiloni, I. Marras, F. Kokkinos, J. Deng, G.G. Chrysos, and S. Zafeiriou. "Poly-NL: Linear complexity non-local layers with polynomials." In International Conference on Computer Vision (ICCV), 2021. ICCV 2021
  12. S.M. Jayakumar, W.M. Czarnecki, J. Menick, J. Schwarz, J. Rae, S. Osindero, Y.W. Teh, T. Harley, and R. Pascanu. "Multiplicative interactions and where to find them." In International Conference on Learning Representations (ICLR), 2020. ICLR 2020
    OpenReview
  13. V.I. Slyusar. "A family of face products of matrices and its properties." Cybernetics and Systems Analysis, 1999. Cybernetics 1999
  14. T.G. Kolda and B.W. Bader. "Tensor decompositions and applications." SIAM Review, 2009. SIAM Review 2009
  15. N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville. "On the spectral bias of neural networks." In International Conference on Machine Learning (ICML), 2019. ICML 2019
    arXiv:1806.08734
  16. Y. Wu, Z. Zhu, F. Liu, G.G. Chrysos, and V. Cevher. "Extrapolation and spectral bias of neural nets with Hadamard product: a polynomial net study." In Advances in Neural Information Processing Systems (NeurIPS), 2022. NeurIPS 2022
  17. G.G. Chrysos, B. Wang, J. Deng, and V. Cevher. "Regularization of polynomial networks for image recognition." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. CVPR 2023
    Code
  18. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. "Efficient projections onto the \(\ell_1\)-ball for learning in high dimensions." In Proceedings of the 25th International Conference on Machine Learning (ICML), 2008. ICML 2008
  19. P.L. Bartlett and S. Mendelson. "Rademacher and Gaussian complexities: Risk bounds and structural results." Journal of Machine Learning Research, 2002. JMLR 2002
  20. A. Jacot, F. Gabriel, and C. Hongler. "Neural tangent kernel: Convergence and generalization in neural networks." In NeurIPS, 2018. NeurIPS 2018
    arXiv:1806.07572
  21. Y. Panagakis, J. Kossaifi, G.G. Chrysos, J. Oldfield, M.A. Nicolaou, A. Anandkumar, and S. Zafeiriou. "Tensor methods in computer vision and deep learning." In Proceedings of the IEEE, 2021. IEEE 2021
  22. G.G. Chrysos, Y. Wu, R. Pascanu, P. Torr, and V. Cevher. "Hadamard product in deep learning: Introduction, Advances and Challenges." In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. TPAMI 2025
  23. C. Fronk and L. Petzold. "Interpretable Polynomial Neural Ordinary Differential Equations." In Chaos: An Interdisciplinary Journal of Nonlinear Science, 2023. 2023
  24. A. Dubey, F. Radenovic, and D. Mahajan. "Scalable Interpretability via Polynomials." In Advances in Neural Information Processing Systems (NeurIPS), 2022. NeurIPS 2022
  25. Y. Wu, GG. Chrysos, V. Cevher. "Adversarial audio synthesis with complex-valued polynomial networks." In ICML Workshop on New Frontiers in Adversarial Machine Learning, 2022. ICML Workshop 2022
  26. F. Fan, H. Dong, Z. Wu, L. Ruan, T. Zeng, Y. Cui, J. Liao. "One neuron saved is one neuron earned: On parametric efficiency of quadratic networks." In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. TPAMI 2025
  27. K. He, X. Zhang, S. Ren, and J. Sun. "Deep residual learning for image recognition." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. CVPR 2016
    arXiv:1512.03385