[Submitted on 2 Nov 2025]
SparseGLU: A Study of Dynamic Neuron Selection in Transformer Feedforward Networks
View PDFAbstract:We present a comprehensive investigation of SparseGLU, an approach to feedforward networks that dynamically selects neurons through a learned predictor. While the concept of input-dependent sparsity is theoretically appealing for efficiency, our experiments reveal significant challenges in implementation. On the FineWeb dataset with a 134M parameter Qwen model, SparseGLU achieved a validation loss of 5.02 compared to the SWiGLU baseline of 4.9266. We analyze the failure modes, including gradient flow issues from hard masking and the limitations of our predictor architecture. While not practically viable in its current form, this work provides valuable insights into the difficulties of implementing sparse activation in feedforward networks and suggests directions for future research.
Submission history
[v1] Sun, 2 Nov 2025 12:06 UTC