[Submitted on 30 Oct 2025]
Dynamic Sparse Attention with Learned Head Gating: Methods and Analysis
View PDFAbstract:We present a systematic study of dynamic head gating combined with local windowed attention for transformer language models. Our method introduces learned per-head gating coefficients that adapt based on input content, combined with an efficient local attention window. We provide detailed implementation specifics, ablation studies, and analysis of the tradeoffs between efficiency and performance. On the FineWeb dataset using a 134M parameter Qwen architecture, our method achieves a 5.7\% improvement in validation loss compared to baseline attention mechanisms while maintaining comparable computational efficiency.
Submission history
[v1] Thu, 30 Oct 2025 06:14 UTC