[Submitted on 28 Oct 2025]
Dynamic Sparse Attention for Efficient Language Modeling
View PDFAbstract:We present a dynamic sparse attention mechanism that combines learned content-aware gating with efficient windowed attention patterns. Our approach addresses the quadratic complexity of standard attention while maintaining modeling performance. Evaluated on the FineWeb dataset using a 134M parameter model, our method achieves a validation loss of 4.904, outperforming standard attention baselines (4.9266) while reducing memory usage by 21\%. The key innovations include: (1) dynamic head gating that adapts computation based on input content, and (2) hybrid attention patterns that combine local windowing with global information flow. Experiments demonstrate our method's effectiveness at balancing computational efficiency and model quality, with particular advantages on longer sequences. We provide extensive ablation studies validating our design choices and discuss directions for future improvements.
Submission history
[v1] Tue, 28 Oct 2025 16:24 UTC