LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters Paper • 2405.17604 • Published May 27, 2024 • 3
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models Paper • 2512.02556 • Published Dec 2, 2025 • 258
Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers Paper • 2510.11370 • Published Oct 13, 2025 • 4
Min P Sampling: Balancing Creativity and Coherence at High Temperature Paper • 2407.01082 • Published Jul 1, 2024 • 1
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free Paper • 2505.06708 • Published May 10, 2025 • 11
Gated Delta Networks: Improving Mamba2 with Delta Rule Paper • 2412.06464 • Published Dec 9, 2024 • 15
Approximating Two-Layer Feedforward Networks for Efficient Transformers Paper • 2310.10837 • Published Oct 16, 2023 • 11
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Paper • 2006.16236 • Published Jun 29, 2020 • 4
Fast Inference from Transformers via Speculative Decoding Paper • 2211.17192 • Published Nov 30, 2022 • 11
Regress, Don't Guess -- A Regression-like Loss on Number Tokens for Language Models Paper • 2411.02083 • Published Nov 4, 2024 • 2
DAPO: An Open-Source LLM Reinforcement Learning System at Scale Paper • 2503.14476 • Published Mar 18, 2025 • 144
Understanding R1-Zero-Like Training: A Critical Perspective Paper • 2503.20783 • Published Mar 26, 2025 • 59
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Paper • 2010.11929 • Published Oct 22, 2020 • 15
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Paper • 2402.03300 • Published Feb 5, 2024 • 141
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations Paper • 2312.08935 • Published Dec 14, 2023 • 4
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper • 2501.12948 • Published Jan 22, 2025 • 440
Direct Preference Optimization: Your Language Model is Secretly a Reward Model Paper • 2305.18290 • Published May 29, 2023 • 64
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model Paper • 2405.04434 • Published May 7, 2024 • 25
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models Paper • 2401.06066 • Published Jan 11, 2024 • 59
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts Paper • 2408.15664 • Published Aug 28, 2024 • 15
ST-MoE: Designing Stable and Transferable Sparse Expert Models Paper • 2202.08906 • Published Feb 17, 2022 • 3
Gemma 2: Improving Open Language Models at a Practical Size Paper • 2408.00118 • Published Jul 31, 2024 • 78
Effective Approaches to Attention-based Neural Machine Translation Paper • 1508.04025 • Published Aug 17, 2015 • 3
Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning Paper • 1702.03118 • Published Feb 10, 2017
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Paper • 2305.13245 • Published May 22, 2023 • 6
RoFormer: Enhanced Transformer with Rotary Position Embedding Paper • 2104.09864 • Published Apr 20, 2021 • 17
YaRN: Efficient Context Window Extension of Large Language Models Paper • 2309.00071 • Published Aug 31, 2023 • 80