Hannah Kirk

Zacharie Delpierre Coudert

Alexander Whitefield

CoRR, 2024

Introducing v0.5 of the AI Safety Benchmark from MLCommons.

[BibT_eX]

[DOI]

Borhane Blili-Hamelin

Kurt D. Bollacker

Rishi Bomassani

Marisa Ferrara Boston

Joseph Marvin Imperial

Dinesh Jinenhally Naganna

Forough Poursabzi-Sangdeh

Alice Schoenauer Sebag

Elizabeth Anne Watkins

CoRR, 2024

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

Indian-BhED: A Dataset for Measuring India-Centric Biases in Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the 2024 International Conference on Information Technology for Social Good, 2024

Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation.

[BibT_eX]

[DOI]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

2023

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Casteist but Not Racist? Quantifying Disparities in Large Language Model Bias between India and the West.

[BibT_eX]

[DOI]

CoRR, 2023

DoDo Learning: DOmain-DemOgraphic Transfer in Language Models for Detecting Abuse Targeted at Public Figures.

[BibT_eX]

[DOI]

CoRR, 2023

Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets.

[BibT_eX]

[DOI]

Brandon Smith

Miguel Farinha

Max Bain

CoRR, 2023

Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models.

[BibT_eX]

[DOI]

CoRR, 2023

Assessing Language Model Deployment with Risk Cards.

[BibT_eX]

[DOI]

Leon Derczynski

Vidhisha Balachandran

CoRR, 2023

Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback.

[BibT_eX]

[DOI]

CoRR, 2023

SemEval-2023 Task 10: Explainable Detection of Online Sexism.

[BibT_eX]

[DOI]

Proceedings of the The 17th International Workshop on Semantic Evaluation, 2023

DataPerf: Benchmarks for Data-Centric AI Development.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution.

[BibT_eX]

[DOI]

Fernanda Gonçalves Abrantes

Hanwen Zhu

Grace Sodunke

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values.

[BibT_eX]

[DOI]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2022

Is More Data Better? Re-thinking the Importance of Efficiency in Abusive Language Detection with Transformers-Based Active Learning.

[BibT_eX]

[DOI]

Bertie Vidgen

Scott A. Hale

CoRR, 2022

Looking for a Handsome Carpenter! Debiasing GPT-3 Job Advertisements.

[BibT_eX]

[DOI]

CoRR, 2022

Handling and Presenting Harmful Text.

[BibT_eX]

[DOI]

CoRR, 2022

A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning.

[BibT_eX]

[DOI]

Hugo Berg

Yash Bhalgat

Wonsuk Yang

Max Bain

CoRR, 2022

Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-Based Hate.

[BibT_eX]

[DOI]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning.

[BibT_eX]

[DOI]

Hugo Berg

Yash Bhalgat

Hannah Kirk