[Submitted on 21 Oct 2025]
Exploring Cauchy Activations for Transformer Feedforward Networks: A Negative Result
View PDFAbstract:Recent advances in transformer architectures have primarily focused on attention mechanisms, while the feedforward components have received less systematic investigation. We present a comprehensive empirical evaluation of Cauchy activations as an alternative to the commonly used SwiGLU in transformer feedforward networks. Motivated by their bounded nature, smooth gradients, and success in other domains, we hypothesized these properties might improve transformer performance. Through extensive experiments on language modeling tasks using models up to 83M parameters, we find that Cauchy activations consistently underperform standard SwiGLU by 0.193 points in validation loss. While demonstrating stable training dynamics, our results suggest that simple bounded activations may not be sufficient to outperform current gated approaches in this domain without additional architectural innovations. We provide detailed analysis of training dynamics, learned parameters, and failure modes to inform future research directions.
Submission history
[v1] Tue, 21 Oct 2025 05:12 UTC