[Submitted on 20 Oct 2025]
Dual-Gated Feedforward Networks: Enhancing Transformer Feedforward Layers through Parallel Gating
View PDFAbstract:The feedforward layer is a critical component of Transformer architectures, yet its design has remained relatively unchanged since the introduction of Gated Linear Unit (GLU) variants. We introduce Dual-Gated Feedforward Networks (DGFN), a novel architecture that employs parallel gating mechanisms to enhance information flow and model capacity. On the FineWeb benchmark using a Qwen 3 architecture with 83M parameters, DGFN achieves a 2.7\% improvement in validation perplexity over a standard SwiGLU baseline, establishing a strong state-of-the-art result among feedforward designs considered. Ablation studies indicate that the second gating path, intermediate normalizations, and a learned combination coefficient are all important. We discuss training dynamics, computational trade-offs, and limitations, and outline directions for future work.
Submission history
[v1] Mon, 20 Oct 2025 04:06 UTC