2025
MMTEB: Massive Multilingual Text Embedding Benchmark.
CoRR, February, 2025

SHADES: Towards a Multilingual Assessment of Stereotypes in Large Language Models.
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

Bridging the Data Provenance Gap Across Text, Speech, and Video.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs.
CoRR, 2024

Bridging the Data Provenance Gap Across Text, Speech and Video.
CoRR, 2024

Consent in Crisis: The Rapid Decline of the AI Data Commons.
CoRR, 2024

StarCoder 2 and The Stack v2: The Next Generation.
CoRR, 2024

Consent in Crisis: The Rapid Decline of the AI Data Commons.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

2023
StarCoder: may the source be with you!
Trans. Mach. Learn. Res., 2023

SantaCoder: don't reach for the stars!
CoRR, 2023

2022
PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts.
CoRR, 2022

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Multitask Prompted Training Enables Zero-Shot Task Generalization.
Proceedings of the Tenth International Conference on Learning Representations, 2022

How sensitive are translation systems to extra contexts? Mitigating gender bias in Neural Machine Translation models through relevant contexts.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

2021
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP.
CoRR, 2021

Multitask Prompted Training Enables Zero-Shot Task Generalization.
CoRR, 2021

Evaluating Gender Bias in Natural Language Inference.
CoRR, 2021

2020
Assessing Viewer's Mental Health by Detecting Depression in YouTube Videos.
CoRR, 2020