2024
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions.
CoRR, 2024

Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order.
CoRR, 2024

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning.
CoRR, 2024

Consent in Crisis: The Rapid Decline of the AI Data Commons.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

2022
BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing.
CoRR, 2022

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

BigBio: A Framework for Data-Centric Biomedical Natural Language Processing.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022