[Submitted on 1 Nov 2025]
Adaptive Activation Mixing: A Comprehensive Study of Dynamic Activation Combination in Transformer Feedforward Networks
View PDFAbstract:This paper presents a thorough investigation of Adaptive Activation Mixing (AAM), a novel approach for dynamically combining activation functions in Transformer feedforward networks. While initial ablation studies on smaller models (83M parameters) showed promising results, with AAM achieving a validation loss of 5.706 compared to the SwiGLU baseline's 5.660, the method failed to scale effectively to larger architectures. In full-scale experiments with 134M parameters, AAM achieved a validation loss of 5.011, underperforming the SwiGLU baseline (4.927) and state-of-the-art methods (best: 4.792). Through detailed analysis of training dynamics, gradient behavior, and memory usage, we identify key limitations of the approach and provide insights for future work in adaptive activation functions.
Submission history
[v1] Sat, 1 Nov 2025 02:26 UTC