[Submitted on 28 Oct 2025]
Adaptive Threshold Gating: A Simple and Effective Variant for Transformer Feedforward Networks
View PDFAbstract:We investigate \emph{Adaptive Threshold Gating} (ATG), a lightweight modification to Transformer feedforward networks that mixes a smooth SiLU pathway with a thresholded ReLU pathway under a learned gate. On the provided training setup, ATG attains a validation loss of \textbf{4.874}, outperforming a strong SwiGLU baseline (\textbf{4.9266}) by \textbf{0.0526}. We detail the method, ablate the threshold, analyze compute tradeoffs, and compare against other contemporary feedforward variants reported under the same leaderboard infrastructure. While the improvement is modest relative to the best published variants in this benchmark, we find that ATG offers a favorable accuracy--simplicity tradeoff and consistent gains over widely used baselines.
Submission history
[v1] Tue, 28 Oct 2025 09:50 UTC