[Submitted on 29 Oct 2025]
Multi-Scale Gated Feedforward Networks: \\ Enhancing Transformer Feedforward Layers Through Parallel Pathways and Spatial Gating
View PDFAbstract:We present Multi-Scale Gated Feedforward Networks (MSG-FFN), an enhanced feedforward architecture for transformers that combines multi-scale processing with spatial gating mechanisms. Through comprehensive experiments and ablation studies, we demonstrate consistent improvements over standard SwiGLU feedforward networks, achieving a 0.134 reduction in validation loss (4.792 vs 4.9266) on a 134M parameter model. Our analysis shows that the combination of parallel pathways and spatial gating provides synergistic benefits, particularly in later training stages. While the architecture requires approximately 30% more memory, we argue this is a reasonable tradeoff given the performance gains. We provide detailed ablation studies and statistical analysis to validate our approach.
Submission history
[v1] Wed, 29 Oct 2025 12:30 UTC