Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

Bridging the Data Provenance Gap Across Text, Speech, and Video.

[DOI]

Shayne Longpre

Nikhil Singh

et al.

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs.

[DOI]

CoRR, 2024

Bridging the Data Provenance Gap Across Text, Speech and Video.

[DOI]

CoRR, 2024

Consent in Crisis: The Rapid Decline of the AI Data Commons.

[DOI]

CoRR, 2024

StarCoder 2 and The Stack v2: The Next Generation.

[DOI]

CoRR, 2024

Consent in Crisis: The Rapid Decline of the AI Data Commons.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

2023

StarCoder: may the source be with you!

[DOI]

Evgenii Zheltonozhskii

Logesh Kumar Umapathi

Urvashi Bhattacharyya

Carolyn Jane Anderson

Carlos Muñoz Ferrandis

Trans. Mach. Learn. Res., 2023

SantaCoder: don't reach for the stars!

[DOI]

CoRR, 2023

2022

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts.

[DOI]

CoRR, 2022

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset.

[DOI]

Albert Villanova del Moral

Teven Le Scao

Leandro von Werra

Chenghao Mou

Eduardo González Ponferrada

Angelina McMillan-Major

David Ifeoluwa Adelani

Alexandra Sasha Luccioni

Yacine Jernite

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Multitask Prompted Training Enables Zero-Shot Task Generalization.

[DOI]

Proceedings of the Tenth International Conference on Learning Representations, 2022

How sensitive are translation systems to extra contexts? Mitigating gender bias in Neural Machine Translation models through relevant contexts.

[DOI]

Shanya Sharma

Manan Dey

Koustuv Sinha

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts.

[DOI]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

2021

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP.

[DOI]

CoRR, 2021

Multitask Prompted Training Enables Zero-Shot Task Generalization.

[DOI]

CoRR, 2021

Evaluating Gender Bias in Natural Language Inference.

[DOI]

Shanya Sharma

Manan Dey

Koustuv Sinha

CoRR, 2021

2020

Assessing Viewer's Mental Health by Detecting Depression in YouTube Videos.

[DOI]

Shanya Sharma

Manan Dey

CoRR, 2020