[Submitted on 26 Oct 2025]
Dynamic Sparse Gating: A Learned Approach to Feedforward Adaptation in Transformers
View PDFAbstract:This paper presents Dynamic Sparse Gating (DSG), a novel approach to Transformer feedforward layers that combines learned sparsity patterns with input-dependent dynamic modulation. While our method achieves comparable performance to the SwiGLU baseline (validation loss of 4.935 vs 4.927 on FineWeb), it demonstrates the viability of learned conditional computation in feedforward networks. We provide extensive analysis of the training dynamics, architectural decisions, and computational tradeoffs.
Submission history
[v1] Sun, 26 Oct 2025 14:18 UTC