[Submitted on 31 Oct 2025]
Hybrid Dynamic Sparse Attention
View PDFAbstract:We present a careful analysis of Hybrid Dynamic Sparse Attention (HDSA), combining local and global attention patterns through learned gating. After addressing initial measurement artifacts, our verified implementation shows a 18.7\% reduction in validation loss (4.04 vs baseline 4.93) on the FineWeb benchmark using a Qwen3 architecture, with comparable computational cost. The revised results demonstrate that dynamic pattern combination can improve model performance without increasing asymptotic complexity. We provide complete implementation details, multiple training runs, and thorough ablation studies to validate our findings. The work includes analysis of computational tradeoffs and identifies key limitations in pattern-interference that future work should address.
Submission history
[v1] Fri, 31 Oct 2025 02:02 UTC