[Submitted on 6 Nov 2025]
Momentum-Aware Layer-wise Adaptive Optimization: \\ A Comprehensive Negative Result Study
View PDFAbstract:We present a detailed empirical investigation of Momentum-Aware Layer-wise Adaptive Optimization (MALAO) for large language models. Despite incorporating recent advances in adaptive optimization, our method consistently underperformed the AdamW baseline (11.71 vs 4.93 validation loss). Through extensive ablation studies and analysis, we identify key failure modes in layer-wise adaptation approaches and provide insights into optimizer design tradeoffs. This work contributes a carefully documented negative result along with practical recommendations for optimizer development.
Submission history
[v1] Thu, 6 Nov 2025 05:42 UTC