[Submitted on 30 Oct 2025]
Analysis of Adaptive Frequency Scaling in Transformer Attention Mechanisms
View PDFAbstract:We present a comprehensive study of adaptive frequency scaling in transformer attention mechanisms, focusing on modifications to rotary positional embeddings (RoPE). Our method introduces learnable, input-dependent frequency scaling factors through a gating network while maintaining the computational efficiency of standard attention. Through extensive experiments on the FineWeb dataset using Qwen architectures, we demonstrate that this approach underperforms the baseline (validation loss 5.100 vs 4.927). We provide detailed analysis of the failure modes, including visualization of learned scaling patterns and attention head behavior. While theoretically promising, our results suggest that simple frequency adaptation may not be sufficient to improve upon standard RoPE, and we discuss implications for future work on dynamic positional encoding schemes.
Submission history
[v1] Thu, 30 Oct 2025 08:48 UTC