Paul Röttger
According to our database1,
Paul Röttger
authored at least 32 papers
between 2021 and 2024.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
On csauthors.net:
Bibliography
2024
The benefits, risks and bounds of personalizing the alignment of large language models to individuals.
Nat. Mac. Intell., 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation.
CoRR, 2024
CoRR, 2024
From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets.
CoRR, 2024
The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models.
CoRR, 2024
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think.
CoRR, 2024
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety.
CoRR, 2024
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024
Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset.
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024
Proceedings of the Forty-first International Conference on Machine Learning, 2024
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts.
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications, 2024
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
Compromesso! Italian Many-Shot Jailbreaks undermine the safety of Large Language Models.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ.
Proceedings of the Findings of the Association for Computational Linguistics, 2024
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models.
Proceedings of the Findings of the Association for Computational Linguistics, 2024
2023
SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models.
CoRR, 2023
The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models.
CoRR, 2023
The Ecological Fallacy in Annotation: Modelling Human Label Variation goes beyond Sociodemographics.
CoRR, 2023
Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback.
CoRR, 2023
Proceedings of the The 17th International Workshop on Semantic Evaluation, 2023
The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
The Ecological Fallacy in Annotation: Modeling Human Label Variation goes beyond Sociodemographics.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023
Improving the Detection of Multilingual Online Attacks with Rich Social Media Data from Singapore.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023
2022
Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models.
CoRR, 2022
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022
Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-Based Hate.
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022
Data-Efficient Strategies for Expanding Hate Speech Detection into Under-Resourced Languages.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
2021
Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, 2021
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021