A Survey on Data Selection for Language Models.

Trans. Mach. Learn. Res., 2024

A large-scale audit of dataset licensing and attribution in AI.

Nat. Mac. Intell., 2024

Bridging the Data Provenance Gap Across Text, Speech and Video.

LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation.

Scaling Laws for Precision.

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models.

OLMoE: Open Mixture-of-Experts Language Models.

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents.

Consent in Crisis: The Rapid Decline of the AI Data Commons.

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies.

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval.

RegMix: Data Mixture as Regression for Language Model Pre-training.

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions.

DataComp-LM: In search of the next generation of training sets for language models.

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages.

Lessons from the Trenches on Reproducible Evaluation of Language Models.

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence.

Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order.

Language models scale reliably with over-training and on downstream tasks.

StarCoder 2 and The Stack v2: The Next Generation.

KMMLU: Measuring Massive Multitask Language Understanding in Korean.

Generative Representational Instruction Tuning.

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning.

KTO: Model Alignment as Prospect Theoretic Optimization.

OLMo: Accelerating the Science of Language Models.

Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models.

C-Pack: Packed Resources For General Chinese Embeddings.

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies.

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Consent in Crisis: The Rapid Decline of the AI Data Commons.

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

DataComp-LM: In search of the next generation of training sets for language models.

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding.

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Model Alignment as Prospect Theoretic Optimization.

Proceedings of the Forty-first International Conference on Machine Learning, 2024

OctoPack: Instruction Tuning Code Large Language Models.

Proceedings of the Twelfth International Conference on Learning Representations, 2024

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages.

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model.

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning.

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

OLMo: Accelerating the Science of Language Models.

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models.

Trans. Mach. Learn. Res., 2023

StarCoder: may the source be with you!

Trans. Mach. Learn. Res., 2023

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI.

C-Pack: Packaged Resources To Advance General Chinese Embedding.

SantaCoder: don't reach for the stars!

Scaling Data-Constrained Language Models.

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

FinGPT: Large Generative Models for a Small Language.

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

MTEB: Massive Text Embedding Benchmark.

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting.

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Crosslingual Generalization through Multitask Finetuning.

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.

What Language Model to Train if You Have One Million GPU Hours?

SGPT: GPT Sentence Embeddings for Semantic Search.

What Language Model to Train if You Have One Million GPU Hours?

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation.

Diagnosing the Impact of AI on Radiology in China.

Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes.

The Hateful Memes Challenge: Competition Report.

Proceedings of the NeurIPS 2020 Competition and Demonstration Track, 2020