ColorGrid: A Multi-Agent Non-Stationary Environment for Goal Inference and Assistance.
CoRR, January, 2025
To Err Is AI: A Case Study Informing LLM Flaw Reporting Practices.
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models.
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties.
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024
What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts and Rationales for Disambiguating Defeasible Social and Moral Situations.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023