The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models.

[DOI]

Hannah Rose Kirk

Alexander Whitefield

Paul Röttger

CoRR, 2024

Introducing v0.5 of the AI Safety Benchmark from MLCommons.

[DOI]

Borhane Blili-Hamelin

Kurt D. Bollacker

Rishi Bomassani

Marisa Ferrara Boston

Zacharie Delpierre Coudert

Joseph Marvin Imperial

Dinesh Jinenhally Naganna

Forough Poursabzi-Sangdeh

Alice Schoenauer Sebag

Elizabeth Anne Watkins

CoRR, 2024

SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety.

[DOI]

CoRR, 2024

The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models.

[DOI]

Rafael Mosquera Gómez

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.

[DOI]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

Position: TrustLLM: Trustworthiness in Large Language Models.

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI.

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

2023

FinanceBench: A New Benchmark for Financial Question Answering.

[DOI]

CoRR, 2023

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models.

[DOI]

CoRR, 2023

The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models.

[DOI]

CoRR, 2023

Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback.

[DOI]

CoRR, 2023

SemEval-2023 Task 10: Explainable Detection of Online Sexism.

[DOI]

Proceedings of the The 17th International Workshop on Semantic Evaluation, 2023

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values.

[DOI]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Improving the Detection of Multilingual Online Attacks with Rich Social Media Data from Singapore.

[DOI]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022

Editorial for Special Issue on Detecting, Understanding and Countering Online Harms.

[DOI]

Online Soc. Networks Media, 2022

Tackling racial bias in automated online hate detection: Towards fair and accurate detection of hateful users with geometric deep learning.

[DOI]

Zo Ahmed

Bertie Vidgen

Scott A. Hale

EPJ Data Sci., 2022

How can we combat online misinformation? A systematic overview of current interventions and their efficacy.

[DOI]

CoRR, 2022

Is More Data Better? Re-thinking the Importance of Efficiency in Abusive Language Detection with Transformers-Based Active Learning.

[DOI]

Hannah Rose Kirk

Bertie Vidgen

Scott A. Hale

CoRR, 2022

Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models.

[DOI]

CoRR, 2022

Handling and Presenting Harmful Text.

[DOI]

CoRR, 2022

Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks.

[DOI]

Paul Röttger

Bertie Vidgen

Dirk Hovy

Janet B. Pierrehumbert

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-Based Hate.

[DOI]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

Handling and Presenting Harmful Text in NLP Research.

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

2021

An influencer-based approach to understanding radical right viral tweets.

[DOI]

CoRR, 2021

Tackling Racial Bias in Automated Online Hate Detection: Towards Fair and Accurate Classification of Hateful Online Users Using Geometric Deep Learning.

[DOI]

Zo Ahmed

Bertie Vidgen

Scott A. Hale

CoRR, 2021

Introducing CAD: the Contextual Abuse Dataset.

[DOI]

Bertie Vidgen

Dong Nguyen

Helen Z. Margetts

Patrícia G. C. Rossini

Rebekah Tromble

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021

Dynabench: Rethinking Benchmarking in NLP.

[DOI]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021

An Expert Annotated Dataset for the Detection of Online Misogyny.

[DOI]

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection.

[DOI]

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

HateCheck: Functional Tests for Hate Speech Detection Models.

[DOI]

Janet B. Pierrehumbert

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

Deciphering Implicit Hate: Evaluating Automated Detection Algorithms for Multimodal Hate.

[DOI]

Austin Botelho

Scott Hale

Bertie Vidgen

Proceedings of the Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, 2021

2020

Directions in Abusive Language Training Data: Garbage In, Garbage Out.

[DOI]

Bertie Vidgen

Leon Derczynski

CoRR, 2020