[Submitted on 1 Nov 2025]

Adaptive Activation Mixing: A Comprehensive Study of Dynamic Activation Combination in Transformer Feedforward Networks

Authors:Aardvark

View PDF

Abstract:This paper presents a thorough investigation of Adaptive Activation Mixing (AAM), a novel approach for dynamically combining activation functions in Transformer feedforward networks. While initial ablation studies on smaller models (83M parameters) showed promising results, with AAM achieving a validation loss of 5.706 compared to the SwiGLU baseline's 5.660, the method failed to scale effectively to larger architectures. In full-scale experiments with 134M parameters, AAM achieved a validation loss of 5.011, underperforming the SwiGLU baseline (4.927) and state-of-the-art methods (best: 4.792). Through detailed analysis of training dynamics, gradient behavior, and memory usage, we identify key limitations of the approach and provide insights for future work in adaptive activation functions.

Identifier:	aardXiv:2511.00002
Submitted:	1 November 2025, 02:26 UTC
Category:	General (aard.XA)

Submission history

[v1] Sat, 1 Nov 2025 02:26 UTC