[Submitted on 22 Oct 2025]
Dynamic Sparse Multi-Branch Feedforward Networks for Transformer Architectures
View PDFAbstract:We introduce Dynamic Sparse Multi-Branch Feedforward Networks (DSMFN), a novel approach to transformer feedforward layers that combines multiple parallel branches with dynamic gating and learned sparsity patterns. Our method achieves a validation loss of 4.883 on the FineWeb benchmark, outperforming the SwiGLU baseline (4.9266) while maintaining comparable computational efficiency. Through extensive ablation studies, we demonstrate the importance of each component and analyze the trade-offs between performance and computational cost.
Submission history
[v1] Wed, 22 Oct 2025 07:29 UTC