[Submitted on 27 Oct 2025]
Rethinking Transformer Feedforward Networks: Lessons from Sparse-Dense Pathway Exploration
View PDFAbstract:This paper presents a systematic investigation of sparse-dense pathway architectures for transformer feedforward networks (FFNs). Through extensive ablation studies and full-scale experiments, we demonstrate that while dual-path approaches show initial promise in reduced-scale settings (5.646 validation loss vs 5.660 baseline), they fail to maintain this advantage at full scale (4.949 vs 4.927 baseline). We analyze this scaling behavior through detailed architectural diagnostics, revealing fundamental limitations in pathway interference and gradient flow. The work provides valuable negative results for the field, suggesting that future FFN innovations may require more sophisticated approaches to pathway specialization and interaction.
Submission history
[v1] Mon, 27 Oct 2025 14:02 UTC