BigScience Workshop

non-profit

https://bigscience.huggingface.co

bigscienceW

bigscience-workshop

AI & ML interests

A one-year long research workshop on large language models: the Summer of Language Models 21 🌸

Recent Activity

julien-c submitted a paper 9 days ago

Shaping capabilities with token-level data filtering

pjox authored a paper 10 days ago

SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing

afaji authored a paper 12 days ago

PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues

View all activity

soldni

authored 11 papers 4 days ago

2 OLMo 2 Furious

Paper • 2501.00656 • Published Dec 31, 2024 • 22

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

Paper • 2502.10341 • Published Feb 14, 2025 • 3

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

Paper • 2502.18443 • Published Feb 25, 2025 • 9

DataDecide: How to Predict Best Pretraining Data with Small Experiments

Paper • 2504.11393 • Published Apr 15, 2025 • 18

Teaching Models to Understand (but not Generate) High-risk Data

Paper • 2505.03052 • Published May 5, 2025 • 6

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5, 2025 • 60

FlexOlmo: Open Language Models for Flexible Data Use

Paper • 2507.07024 • Published Jul 9, 2025 • 9

olmOCR 2: Unit Test Rewards for Document OCR

Paper • 2510.19817 • Published Oct 22, 2025 • 16

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Paper • 2511.19399 • Published Nov 24, 2025 • 61

Olmo 3

Paper • 2512.13961 • Published Dec 15, 2025 • 28

Bolmo: Byteifying the Next Generation of Language Models

Paper • 2512.15586 • Published Dec 17, 2025 • 17

gentaiscool

authored a paper 12 days ago

PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues

Paper • 2601.17277 • Published 16 days ago • 5

yjernite

authored a paper 13 days ago

INTIMA: A Benchmark for Human-AI Companionship Behavior

Paper • 2508.09998 • Published Aug 4, 2025 • 11

armanc

authored a paper 23 days ago

Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Paper • 2601.09876 • Published 25 days ago • 6

shubhamagarwal92

authored a paper about 1 month ago

BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

Paper • 2511.10338 • Published Nov 13, 2025

christopher

in bigscience/bloomz-560m 2 months ago

Fails to load with transformers v4.57+

#14 opened 2 months ago by

nihalnayak

authored a paper 2 months ago

Revisiting Generalization Across Difficulty Levels: It's Not So Easy

Paper • 2511.21692 • Published Nov 26, 2025 • 15

shannons

authored 3 papers 3 months ago

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

Paper • 2406.07835 • Published Jun 10, 2024 • 2

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Paper • 2510.09541 • Published Oct 10, 2025 • 17

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Paper • 2511.19399 • Published Nov 24, 2025 • 61