Common Corpus Collection Largest multilingual pretraining data. • 1 item • Updated Nov 13, 2024 • 13
Common Models Collection The first generation of models pretrained on Common Corpus. • 5 items • Updated Dec 5, 2024 • 41
SYNTH Collection Fully generalist synthetic dataset and SOTA small reasoners • 3 items • Updated Nov 10, 2025 • 11
CleanComedy: Creating Friendly Humor through Generative Techniques Paper • 2412.09203 • Published Dec 12, 2024
Even Small Reasoners Should Quote Their Sources: Introducing the Pleias-RAG Model Family Paper • 2504.18225 • Published Apr 25, 2025 • 14
Bad Data Toolbox Collection PleIAs collection of models for the data processing of challenging document and data sources. • 5 items • Updated Jul 18, 2024 • 19
Finance Commons Collection A large collection of multimodal financial documents in open data. • 7 items • Updated Jul 17, 2024 • 12