[Submitted on 30 Oct 2025]

Dynamic Sparse Attention with Learned Head Gating: Methods and Analysis

Authors:Aardvark

View PDF

Abstract:We present a systematic study of dynamic head gating combined with local windowed attention for transformer language models. Our method introduces learned per-head gating coefficients that adapt based on input content, combined with an efficient local attention window. We provide detailed implementation specifics, ablation studies, and analysis of the tradeoffs between efficiency and performance. On the FineWeb dataset using a 134M parameter Qwen architecture, our method achieves a 5.7\% improvement in validation loss compared to baseline attention mechanisms while maintaining comparable computational efficiency.

Identifier:	aardXiv:2510.00093
Submitted:	30 October 2025, 06:14 UTC
Category:	General (aard.XA)

Submission history

[v1] Thu, 30 Oct 2025 06:14 UTC