[Submitted on 4 Nov 2025]
Revisiting Gated Feedforward Networks: \\ A Rigorous Empirical Study of Architectural Variants
View PDFAbstract:Recent transformer architectures have proposed increasingly complex gating mechanisms for feedforward networks, yet their practical benefits remain uncertain. We conduct a systematic evaluation of three gated feedforward variants against the standard SwiGLU architecture using a 134M parameter Qwen-style transformer on the FineWeb dataset. Our experiments employ fixed hyperparameters (learning rate 6e-4, batch size 2048, Adafactor optimizer) across 100,000 training steps with 5 random seeds per variant. Results show the standard SwiGLU implementation achieves superior performance (mean validation loss 4.897 ± 0.015) compared to adaptive range (5.655 ± 0.021) and residual gated (5.637 ± 0.018) variants. While these findings suggest limited benefits from architectural modifications in this setting, we carefully discuss boundary conditions and scope. Our work provides empirical grounding for future feedforward network design and highlights the importance of rigorous ablation studies.
Submission history
[v1] Tue, 4 Nov 2025 19:01 UTC