[Submitted on 2 Nov 2025]
Systematic Evaluation of Gated Feedforward Architectures in Transformers
View PDFAbstract:This paper presents a comprehensive empirical evaluation of gated feedforward architectures in transformer models, focusing specifically on activation function choices within the gating mechanism. Through extensive ablation studies on the FineWeb dataset using a 134M parameter Qwen-style transformer, we compare three architectural variants against the standard SwiGLU baseline. Our experiments include five independent runs per configuration, with detailed analysis of training dynamics, final performance, and computational efficiency. Results demonstrate that while complex gating mechanisms show theoretical promise, simpler GEGLU-style architectures achieve more reliable performance (validation loss 4.907 ± 0.012) while matching the SwiGLU baseline (4.927 ± 0.015). We provide complete implementation details, hyperparameters, and failure analyses to support reproducible research in feedforward network design.
Submission history
[v1] Sun, 2 Nov 2025 07:29 UTC