A Survey on Data Selection for Language Models.
,
,
,
,
,
,
,
,
,
,
,
,
,
Trans. Mach. Learn. Res., 2024
A large-scale audit of dataset licensing and attribution in AI.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Nat. Mac. Intell., 2024
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation.
CoRR, 2024
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
OpenDevin: An Open Platform for AI Software Developers as Generalist Agents.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies.
CoRR, 2024
BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
RegMix: Data Mixture as Regression for Language Model Pre-training.
CoRR, 2024
KMMLU: Measuring Massive Multitask Language Understanding in Korean.
CoRR, 2024
Generative Representational Instruction Tuning.
CoRR, 2024
KTO: Model Alignment as Prospect Theoretic Optimization.
CoRR, 2024
Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models.
CoRR, 2024
C-Pack: Packed Resources For General Chinese Embeddings.
Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Consent in Crisis: The Rapid Decline of the AI Data Commons.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
DataComp-LM: In search of the next generation of training sets for language models.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Model Alignment as Prospect Theoretic Optimization.
Proceedings of the Forty-first International Conference on Machine Learning, 2024
OctoPack: Instruction Tuning Code Large Language Models.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
OLMo: Accelerating the Science of Language Models.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2023
C-Pack: Packaged Resources To Advance General Chinese Embedding.
CoRR, 2023
Scaling Data-Constrained Language Models.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
FinGPT: Large Generative Models for a Small Language.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
MTEB: Massive Text Embedding Benchmark.
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023
Crosslingual Generalization through Multitask Finetuning.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023
SGPT: GPT Sentence Embeddings for Semantic Search.
CoRR, 2022
What Language Model to Train if You Have One Million GPU Hours?
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022
Diagnosing the Impact of AI on Radiology in China.
CoRR, 2021
Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes.
CoRR, 2020
The Hateful Memes Challenge: Competition Report.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the NeurIPS 2020 Competition and Demonstration Track, 2020