BigScience Workshop

non-profit

https://huggingface.co/proxy/bigscience.huggingface.co

bigscience-workshop

AI & ML interests

A one-year long research workshop on large language models: the Summer of Language Models 21 🌸

Recent Activity

lintang authored a paper 16 days ago

Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning

RTT1 authored a paper 18 days ago

EvoClaw: Evaluating AI Agents on Continuous Software Evolution

christopher new activity 22 days ago

bigscience/bloom:[SPAM] Deleted

View all activity

in bigscience/bloom 22 days ago

[SPAM] Deleted

#289 opened 23 days ago by

authored a paper 23 days ago

LLM2Vec-Gen: Generative Embeddings from Large Language Models

Paper • 2603.10913 • Published 23 days ago • 43

in bigscience/bloom 30 days ago

pretokenizer Regex issues?

#278 opened over 1 year ago by

in bigscience/bloom about 1 month ago

Test PR

#286 opened about 1 month ago by

Test discussion

#287 opened about 1 month ago by

Test discussion

#288 opened about 1 month ago by

authored a paper about 2 months ago

A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)

Paper • 2602.14696 • Published Feb 16

submitted a paper to Daily Papers about 2 months ago

A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)

Paper • 2602.14696 • Published Feb 16

submitted a paper to Daily Papers about 2 months ago

Effective Reasoning Chains Reduce Intrinsic Dimensionality

Paper • 2602.09276 • Published Feb 9 • 11

authored 11 papers about 2 months ago

2 OLMo 2 Furious

Paper • 2501.00656 • Published Dec 31, 2024 • 22

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

Paper • 2502.10341 • Published Feb 14, 2025 • 3

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

Paper • 2502.18443 • Published Feb 25, 2025 • 11

DataDecide: How to Predict Best Pretraining Data with Small Experiments

Paper • 2504.11393 • Published Apr 15, 2025 • 18

Teaching Models to Understand (but not Generate) High-risk Data

Paper • 2505.03052 • Published May 5, 2025 • 6

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5, 2025 • 60

FlexOlmo: Open Language Models for Flexible Data Use

Paper • 2507.07024 • Published Jul 9, 2025 • 10

olmOCR 2: Unit Test Rewards for Document OCR

Paper • 2510.19817 • Published Oct 22, 2025 • 16

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Paper • 2511.19399 • Published Nov 24, 2025 • 63

Olmo 3

Paper • 2512.13961 • Published Dec 15, 2025 • 31

Bolmo: Byteifying the Next Generation of Language Models

Paper • 2512.15586 • Published Dec 17, 2025 • 17