[Submitted on 28 Oct 2025]
Gated Linear Units with GELU Activation: An Empirical Study of Feedforward Variations in Transformers
View PDFAbstract:This paper presents a controlled empirical comparison of gated linear unit (GLU) variations in small-scale transformer language models. Through systematic ablation studies with three random seeds, we evaluate SwiGLU, GEGLU, and an experimental Dynamic Polynomial Gating variant on the FineWeb dataset. Our results show GEGLU achieves a mean validation loss of 4.908 $\pm$0.003, modestly but consistently outperforming SwiGLU (4.9266 $\pm$0.004) across all runs. While the performance difference is small, the consistent improvement suggests GELU activation may offer advantages in gated feedforward networks. We provide detailed training dynamics analysis and discuss the limitations of our small-scale study for broader architectural decisions.
Submission history
[v1] Tue, 28 Oct 2025 07:45 UTC