[Submitted on 1 Nov 2025]
Understanding the Limits of Gated Feedforward Modifications
View PDFAbstract:This paper presents a comprehensive empirical study of modifications to SwiGLU-based transformer feedforward networks. Through rigorous experimentation on the FineWeb dataset using a 134M parameter Qwen-style architecture, we evaluate four variants including polynomial expansions and normalization schemes. Our stabilized SwiGLU with LayerNorm achieved comparable performance (validation loss 4.951 vs 4.9266 baseline) while demonstrating improved training stability, evidenced by 18\% lower loss variance across runs. Surprisingly, more complex modifications underperformed, with adaptive polynomial variants showing 15-20\% higher loss. We provide detailed failure analysis of these approaches, examining gradient norms, parameter sensitivity, and layer-wise activation patterns. The results highlight the robustness of the baseline SwiGLU and suggest careful consideration is needed when attempting architectural innovations in feedforward networks.
Submission history
[v1] Sat, 1 Nov 2025 20:40 UTC