The benefits, risks and bounds of personalizing the alignment of large language models to individuals.
Nat. Mac. Intell., 2024
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests.
CoRR, 2024
WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting.
CoRR, 2024
The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models.
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety.
CoRR, 2024
The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024
Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Forty-first International Conference on Machine Learning, 2024
FinanceBench: A New Benchmark for Financial Question Answering.
CoRR, 2023
SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models.
CoRR, 2023
The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models.
CoRR, 2023
Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback.
CoRR, 2023
SemEval-2023 Task 10: Explainable Detection of Online Sexism.
Proceedings of the The 17th International Workshop on Semantic Evaluation, 2023
The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
Improving the Detection of Multilingual Online Attacks with Rich Social Media Data from Singapore.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023
Editorial for Special Issue on Detecting, Understanding and Countering Online Harms.
Online Soc. Networks Media, 2022
Tackling racial bias in automated online hate detection: Towards fair and accurate detection of hateful users with geometric deep learning.
EPJ Data Sci., 2022
How can we combat online misinformation? A systematic overview of current interventions and their efficacy.
CoRR, 2022
Is More Data Better? Re-thinking the Importance of Efficiency in Abusive Language Detection with Transformers-Based Active Learning.
CoRR, 2022
Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models.
CoRR, 2022
Handling and Presenting Harmful Text.
CoRR, 2022
Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks.
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022
Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-Based Hate.
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022
Handling and Presenting Harmful Text in NLP Research.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022
An influencer-based approach to understanding radical right viral tweets.
CoRR, 2021
Tackling Racial Bias in Automated Online Hate Detection: Towards Fair and Accurate Classification of Hateful Online Users Using Geometric Deep Learning.
CoRR, 2021
Introducing CAD: the Contextual Abuse Dataset.
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021
Dynabench: Rethinking Benchmarking in NLP.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021
An Expert Annotated Dataset for the Detection of Online Misogyny.
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021
Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021
HateCheck: Functional Tests for Hate Speech Detection Models.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021
Deciphering Implicit Hate: Evaluating Automated Detection Algorithms for Multimodal Hate.
Proceedings of the Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, 2021
Directions in Abusive Language Training Data: Garbage In, Garbage Out.
CoRR, 2020
Detecting East Asian Prejudice on Social Media.
Proceedings of the Fourth Workshop on Online Abuse and Harms, 2020
Online Abuse and Human Rights: WOAH Satellite Session at RightsCon 2020.
Proceedings of the Fourth Workshop on Online Abuse and Harms, 2020
Trajectories of Islamophobic hate amongst far right actors on Twitter.
CoRR, 2019
What, When and Where of petitions submitted to the UK Government during a time of chaos.
CoRR, 2019
Detecting weak and strong Islamophobic hate speech on social media.
CoRR, 2018
P-values: misunderstood and misused.
CoRR, 2016