arxiv:2605.24425

Momentum Streams for Optimizer-Inspired Transformers

Published on May 23

Authors:

Abstract

Optimizer-inspired Transformers using momentum-based updates outperform vanilla Transformers by reaching flatter minima and improving generalization.

The residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate token energy, wherein the attention and MLP sublayers function as gradient oracles. Based on this observation, we build a family of optimizer-inspired Transformers (triple-momentum, Adam/AdamW, Muon, SOAP) and compare them under matched compute. In our main pretraining experiment, the triple-momentum TMMFormer achieves the lowest validation loss, outperforming the vanilla Transformer and prior architectural variants. A controlled ablation and supporting theory show that momentum, not preconditioning, is the main source of the gain. We further show that TMMFormer and other momentum-based designs reach flatter minima than the vanilla Transformer, which leads to less forgetting and better generalization.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.24425

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

No model linking this paper

Cite arxiv.org/abs/2605.24425 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2605.24425 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2605.24425 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.