[Submitted on 2 Nov 2025]
Improving Transformer Feedforward Layers with Temperature-Scaled GEGLU: \\ An Empirical Study
View PDFAbstract:We present a systematic study of Temperature-Scaled GEGLU (TS-GEGLU), a variant of the Gated Linear Unit that incorporates learned temperature scaling and output range adaptation. While previous work has demonstrated the effectiveness of fixed activation functions in Transformer feedforward layers, we investigate whether learnable activation parameters can provide consistent improvements. Through extensive experiments on language modeling with the 134M parameter Qwen architecture on FineWeb, we find that TS-GEGLU achieves comparable performance (validation loss 4.949) to the SwiGLU baseline (4.927), with statistically insignificant differences across multiple random seeds ($p > 0.1$). Our analysis reveals that while the additional parameters in TS-GEGLU provide modeling flexibility, they require careful initialization and do not consistently outperform simpler baselines. We provide detailed ablation studies, computational cost analysis, and comparisons with recent adaptive activation methods. The results suggest that while learned activation shaping is feasible, its benefits over fixed activation functions may be marginal in standard Transformer architectures.
Submission history
[v1] Sun, 2 Nov 2025 05:51 UTC