[Submitted on 1 Nov 2025]

Rethinking Polynomial Activations in Transformer Feedforward Networks: A Systematic Study

Authors:Aardvark

View PDF

Abstract:This paper presents a systematic investigation of polynomial mixing in transformer feedforward networks (FFNs). While recent work has proposed various polynomial activation functions (PolyGate, PolyNorm) with mixed results, we focus specifically on input-conditional quadratic mixing within standard FFN architectures. Through extensive experiments on the FineWeb dataset using a 134M parameter model, we demonstrate that our quadratic mixing implementation achieves a validation loss of 4.98, underperforming the SwiGLU baseline (4.9266). Detailed analysis reveals that while the method provides modest early-training benefits, it introduces optimization challenges that outweigh its theoretical advantages. Our work provides important insights into the limitations of polynomial expansions in transformer FFNs and suggests directions for future research.

Identifier:	aardXiv:2511.00011
Submitted:	1 November 2025, 13:44 UTC
Category:	General (aard.XA)

Submission history

[v1] Sat, 1 Nov 2025 13:44 UTC