[Submitted on 23 Oct 2025]
Dynamic Gating Feedforward Networks: Analysis of Combining Polynomial Activations with Key-Value Memory Patterns
View PDFAbstract:We present a comprehensive analysis of Dynamic Gating Feedforward Networks (DGFN), an architecture combining polynomial composition activations with key-value memory patterns in transformer feedforward layers. Despite theoretical promise, our experiments on the FineWeb dataset (2B tokens) using a Qwen 3 architecture (83M parameters) show the approach achieves a validation loss of 5.017, underperforming both the SwiGLU baseline (4.927) and state-of-the-art methods (best 4.793). Through extensive ablation studies and architectural analysis, we identify key challenges in combining these mechanisms. Our results suggest that while both polynomial activations and memory patterns individually offer benefits, their combination requires more sophisticated coordination mechanisms than simple learned mixing.
Submission history
[v1] Thu, 23 Oct 2025 16:24 UTC