[Submitted on 1 Nov 2025]
Rethinking Polynomial Activations in Transformer Feedforward Networks: A Systematic Study
View PDFAbstract:This paper presents a systematic investigation of polynomial mixing in transformer feedforward networks (FFNs). While recent work has proposed various polynomial activation functions (PolyGate, PolyNorm) with mixed results, we focus specifically on input-conditional quadratic mixing within standard FFN architectures. Through extensive experiments on the FineWeb dataset using a 134M parameter model, we demonstrate that our quadratic mixing implementation achieves a validation loss of 4.98, underperforming the SwiGLU baseline (4.9266). Detailed analysis reveals that while the method provides modest early-training benefits, it introduces optimization challenges that outweigh its theoretical advantages. Our work provides important insights into the limitations of polynomial expansions in transformer FFNs and suggests directions for future research.
Submission history
[v1] Sat, 1 Nov 2025 13:44 UTC