[Submitted on 28 Oct 2025]
Scaling the Gate: A Minimal but Effective Modification to Transformer Feedforward Networks
View PDFAbstract:This paper investigates whether minimal architectural modifications can yield consistent improvements in transformer feedforward networks. We propose adding a single learned scaling parameter to the gating mechanism, maintaining the original architecture's simplicity while allowing adaptive scaling. On the FineWeb benchmark with a 134M parameter model, our approach achieves a small but consistent improvement (validation loss 4.926 vs 4.9266 baseline). While the absolute gain is modest, the results suggest that carefully targeted minimal modifications can outperform more complex approaches. We provide extensive analysis of the limitations and practical considerations, offering insights for future research into efficient architectural modifications.
Submission history
[v1] Tue, 28 Oct 2025 20:17 UTC