view article Article Ulysses Sequence Parallelism: Training with Million-Token Contexts 7 days ago • 20
view article Article FlashHead: Accelerating Language Model Inference ~ *Efficient drop-in replacement for the classification head* 4 days ago • 1
Nemotron-Pre-Training-Datasets Collection Large scale pre-training datasets used in the Nemotron family of models. • 12 items • Updated 5 days ago • 121
Lost in Backpropagation: The LM Head is a Gradient Bottleneck Paper • 2603.10145 • Published 5 days ago • 7
NVIDIA Nemotron v3 Collection Open, Production-ready Enterprise Models • 12 items • Updated 4 days ago • 200
MixtureVitae study models and datasets Collection Collection of models and dataset related to MixtureVitae, open and fully reproducible pretraining dataset built from permissive sources • 16 items • Updated Feb 13 • 1
Running on CPU Upgrade 183 The Synthetic Data Playbook: Generating Trillions of the Finest Tokens 📝 183 Explore synthetic data experiments on a virtual bookshelf
view article Article Scaling Pedagogical Pre-training: From Optimal Mixing to 10 Billion Tokens 10 days ago • 4
🤏 Smol-Data Collection Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing • 14 items • Updated 14 days ago • 12