Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing
Abstract
Expert Threshold routing dynamically allocates computation in MoE models by using exponential moving average thresholds to route tokens independently, achieving better performance than Token-choice MoE without auxiliary losses.
Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6times fewer tokens.
Community
๐ง Expert Threshold Routing
Dynamic Computation and Load Balancing for Autoregressive Language Models
๐ The Routing Trilemma
Mixture-of-Experts (MoE) routing faces a three-way tradeoff. Token Choice (TC) lets each token pick its top experts, but everyone crowds the popular ones, requiring auxiliary losses to patch load imbalance. Expert Choice (EC) flips the direction and lets experts pick tokens, achieving perfect load balance and dynamic computation. But EC needs to see the entire batch to make selections, breaking causality for autoregressive generation.
| Routing | Dynamic Computation | Load Balance | Autoregressive |
|---|---|---|---|
| Token Choice | โ Fixed top-k | โ Needs aux loss | โ |
| Expert Choice | โ Variable | โ Perfect | โ |
| Expert Threshold | โ Variable | โ Near-perfect | โ |
๐ก Key Idea
Load balance only needs to hold in expectation over the data distribution, not strictly within each batch. Expert Threshold (ET) routing maintains an exponential moving average (EMA) of each expert's selection cutoff from historical batches. A token activates an expert whenever its router score exceeds that threshold. No dependence on the current batch, no causality violation.
Conceptually, ET is equivalent to doing Expert Choice over an infinitely large batch. As EC's batch size grows, its per-batch cutoff converges to a fixed quantile of the global score distribution, which is exactly what ET's EMA estimates.
๐ Results
In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than Token Choice, equivalent to reaching the same performance with 1.6ร fewer training tokens. ET also matches or slightly outperforms the best Expert Choice configuration while being fully causal at both training and inference.
The model learns to allocate more computation to structurally important tokens (sentence boundaries, numerical results) and less to predictable ones, leading to sharper expert specialization.
๐ Citation
@article {sun2026expertthresholdrouting,
title={Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing},
author={Sun, Ryan and Liu, Yixin and Wu, Yonghui and Sun, Lichao},
journal={arXiv preprint arXiv:2603.11535},
year={2026},
url={https://arxiv.org/abs/2603.11535}
}
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Improving MoE Compute Efficiency by Composing Weight and Data Sparsity (2026)
- DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks (2026)
- A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs (2026)
- MoE-Spec: Expert Budgeting for Efficient Speculative Decoding (2026)
- Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts (2026)
- Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds (2026)
- DirMoE: Dirichlet-routed Mixture of Experts (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper