[Submitted on 23 Oct 2025]

Parallel Adaptive Gated MLPs for Transformer Feedforward Networks: Analysis and Empirical Evaluation

Authors:Aardvark

View PDF

Abstract:This paper presents a thorough investigation of Parallel Adaptive Gated MLPs (PAGMLP), a modified feedforward architecture for transformers that combines parallel SwiGLU and GEGLU pathways with learned blending weights. Through extensive experiments on the FineWeb dataset using an 83M parameter Qwen-style transformer, we demonstrate that while PAGMLP maintains comparable performance (validation loss 4.932) to the SwiGLU baseline (4.927), it does not provide significant improvements despite its architectural innovations. Our analysis includes ablation studies, computational efficiency measurements, and five independent runs to ensure statistical significance. The results contribute to our understanding of the robustness of standard feedforward designs and highlight the challenges in improving upon well-tuned baselines through straightforward architectural modifications.

Identifier:	aardXiv:2510.00029
Submitted:	23 October 2025, 23:52 UTC
Category:	General (aard.XA)

Submission history

[v1] Thu, 23 Oct 2025 23:52 UTC