[Submitted on 30 Oct 2025]
Layer-Adaptive Feedforward Networks with Dynamic Scaling: A Systematic Study
View PDFAbstract:We present a systematic study of layer-adaptive feedforward networks in Transformers, examining three established techniques in combination: depth-dependent activations, input-dependent scaling, and learned sparsity. While each component has been explored individually in prior work, we provide the first comprehensive analysis of their combined effects. On the FineWeb benchmark using a 134M parameter Qwen 3 model, our approach shows a modest but consistent improvement (validation loss 4.910 vs 4.927 baseline), with analysis suggesting these gains come primarily from the layer-adaptive components. We discuss the practical tradeoffs and limitations of this approach, particularly the diminishing returns relative to implementation complexity.
Submission history
[v1] Thu, 30 Oct 2025 22:25 UTC