[Submitted on 2 Nov 2025]
Rethinking Feedforward Network Design: \\ When Simplicity Meets Performance
View PDFAbstract:While recent transformer architectures increasingly employ complex gating mechanisms in their feedforward networks, we demonstrate that carefully designed simple architectures can achieve comparable performance. Through systematic experimentation with a 134M parameter model on the FineWeb dataset, we show our simplified feedforward network achieves 4.940 validation loss versus 4.927 for SwiGLU, while using 20% less memory and 15% fewer FLOPs. The key to our approach lies in optimized initialization schemes and learned residual scaling, which compensate for architectural simplicity. Our results suggest that for many applications, the benefits of complex gating mechanisms may not justify their computational overhead.
Submission history
[v1] Sun, 2 Nov 2025 11:01 UTC