[Submitted on 28 Oct 2025]
Robust Implementation of Grouped Query Attention with Query-Key Normalization
View PDFAbstract:This paper presents a detailed implementation of grouped query attention (GQA) with query-key normalization for transformer language models. While GQA was introduced in \cite{gqa} to improve efficiency, practical implementations often face challenges with dimension handling and numerical stability. Our work provides a robust implementation that properly handles dimension expansion while incorporating RMS normalization for queries and keys. Through careful ablation studies and comparison with baseline models, we demonstrate both the implementation challenges and solutions for stable GQA training. Experiments on the FineWeb dataset show our implementation achieves better training stability compared to baseline approaches, though we note important limitations regarding generalization across different model sizes and architectures.
Submission history
[v1] Tue, 28 Oct 2025 23:57 UTC