2024
A Survey on Data Selection for Language Models.
Trans. Mach. Learn. Res., 2024

A large-scale audit of dataset licensing and attribution in AI.
Nat. Mac. Intell., 2024

Bridging the Data Provenance Gap Across Text, Speech and Video.
CoRR, 2024

LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation.
CoRR, 2024

Scaling Laws for Precision.
CoRR, 2024

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
CoRR, 2024

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models.
CoRR, 2024

OLMoE: Open Mixture-of-Experts Language Models.
CoRR, 2024

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents.
CoRR, 2024

Consent in Crisis: The Rapid Decline of the AI Data Commons.
CoRR, 2024

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies.
CoRR, 2024

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval.
CoRR, 2024

RegMix: Data Mixture as Regression for Language Model Pre-training.
CoRR, 2024

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions.
CoRR, 2024

DataComp-LM: In search of the next generation of training sets for language models.
CoRR, 2024

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages.
CoRR, 2024

Lessons from the Trenches on Reproducible Evaluation of Language Models.
CoRR, 2024

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence.
CoRR, 2024

Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order.
CoRR, 2024

Language models scale reliably with over-training and on downstream tasks.
CoRR, 2024

StarCoder 2 and The Stack v2: The Next Generation.
CoRR, 2024

KMMLU: Measuring Massive Multitask Language Understanding in Korean.
CoRR, 2024

Generative Representational Instruction Tuning.
CoRR, 2024

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning.
CoRR, 2024

KTO: Model Alignment as Prospect Theoretic Optimization.
CoRR, 2024

OLMo: Accelerating the Science of Language Models.
CoRR, 2024

Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models.
CoRR, 2024

C-Pack: Packed Resources For General Chinese Embeddings.
Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Consent in Crisis: The Rapid Decline of the AI Data Commons.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

DataComp-LM: In search of the next generation of training sets for language models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Model Alignment as Prospect Theoretic Optimization.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

OctoPack: Instruction Tuning Code Large Language Models.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

OLMo: Accelerating the Science of Language Models.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

2023
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models.
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
Trans. Mach. Learn. Res., 2023

StarCoder: may the source be with you!
Trans. Mach. Learn. Res., 2023

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI.
CoRR, 2023

C-Pack: Packaged Resources To Advance General Chinese Embedding.
CoRR, 2023

SantaCoder: don't reach for the stars!
CoRR, 2023

Scaling Data-Constrained Language Models.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

FinGPT: Large Generative Models for a Small Language.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

MTEB: Massive Text Embedding Benchmark.
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Crosslingual Generalization through Multitask Finetuning.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting.
CoRR, 2022

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.
CoRR, 2022

What Language Model to Train if You Have One Million GPU Hours?
CoRR, 2022

SGPT: GPT Sentence Embeddings for Semantic Search.
CoRR, 2022

What Language Model to Train if You Have One Million GPU Hours?
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

2021
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation.
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
CoRR, 2021

Diagnosing the Impact of AI on Radiology in China.
CoRR, 2021

2020
Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes.
CoRR, 2020

The Hateful Memes Challenge: Competition Report.
Proceedings of the NeurIPS 2020 Competition and Demonstration Track, 2020