2025
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection.
CoRR, June, 2025

EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition.
CoRR, May, 2025

Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs.
CoRR, February, 2025

2024
LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps.
CoRR, 2024

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming.
CoRR, 2024

Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order.
CoRR, 2024

RedPajama: an Open Dataset for Training Large Language Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

2023
OpenAssistant Conversations - Democratizing Large Language Model Alignment.
CoRR, 2023

SantaCoder: don't reach for the stars!
CoRR, 2023

OpenAssistant Conversations - Democratizing Large Language Model Alignment.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.
CoRR, 2022

Data Governance in the Age of Large-Scale Data-Driven Language Technology.
CoRR, 2022

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Data Governance in the Age of Large-Scale Data-Driven Language Technology.
Proceedings of the FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, June 21, 2022

2021
Improving Transformer-Based Neural Machine Translation with Prior Alignments.
Complex., 2021

Sublemma-Based Neural Machine Translation.
Complex., 2021

2020
Projecting dependency syntax labels from English into Vietnamese in English-Vietnamese bilingual corpus.
Int. J. Intell. Inf. Database Syst., 2020

Mixed-Level Neural Machine Translation.
Comput. Intell. Neurosci., 2020

2019
Preordering for Chinese-Vietnamese Statistical Machine Translation.
IEICE Trans. Inf. Syst., 2019