We stand with Ukraine

We stand with Ukraine

Stephen Casper

Orcid: 0000-0003-0084-1937

According to our database¹, Stephen Casper authored at least 32 papers between 2019 and 2024.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2024

Obfuscated Activations Bypass LLM Latent-Space Defenses.

[BibT_eX]

[DOI]

,

,

Abhay Sheshadri

,

Mikhail Seleznyov

,

,

,

,

,

Carlos Guestrin

,

CoRR, 2024

International Scientific Report on the Safety of Advanced AI (Interim Report).

[BibT_eX]

[DOI]

CoRR, 2024

The Reality of AI and Biorisk.

[BibT_eX]

[DOI]

,

,

,

,

,

,

,

,

,

,

Rishi Bommasani

,

,

CoRR, 2024

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks.

[BibT_eX]

[DOI]

Nathalie Maria Kirch

,

,

CoRR, 2024

Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience.

[BibT_eX]

[DOI]

,

Jascha Achterberg

,

,

,

,

,

,

Ilia Sucholutsky

,

,

,

,

,

,

,

,

Grace W. Lindsay

CoRR, 2024

The AI Risk Repository: A Comprehensive Meta-Review, Database, and Taxonomy of Risks From Artificial Intelligence.

[BibT_eX]

[DOI]

,

Alexander K. Saeri

,

Emily A. C. Grundy

,

,

,

,

,

,

,

CoRR, 2024

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.

[BibT_eX]

[DOI]

Abhay Sheshadri

,

,

,

,

,

,

,

Asa Cooper Stickland

,

,

Dylan Hadfield-Menell

,

CoRR, 2024

Open Problems in Technical AI Governance.

[BibT_eX]

[DOI]

CoRR, 2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability.

[BibT_eX]

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

Jessica Rumbelow

,

Hieu Minh Nguyen

,

Dylan Hadfield-Menell

CoRR, 2024

Defending Against Unforeseen Failure Modes with Latent Adversarial Training.

[BibT_eX]

[DOI]

,

Lennart Schulze

,

,

Dylan Hadfield-Menell

CoRR, 2024

Eight Methods to Evaluate Robust Unlearning in LLMs.

[BibT_eX]

[DOI]

,

,

,

,

Dylan Hadfield-Menell

CoRR, 2024

Rethinking Machine Unlearning for Large Language Models.

[BibT_eX]

[DOI]

,

,

,

,

Nathalie Baracaldo

,

,

,

,

,

Kush R. Varshney

,

,

,

CoRR, 2024

Black-Box Access is Insufficient for Rigorous AI Audits.

[BibT_eX]

[DOI]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024

2023

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2023

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation.

[BibT_eX]

[DOI]

,

Quentin Feuillade-Montixi

,

,

,

,

CoRR, 2023

Measuring the Success of Diffusion Models at Imitating Human Artists.

[BibT_eX]

[DOI]

,

,

Shreya Mogulothu

,

Zachary Marinov

,

Chinmay Deshpande

,

,

,

Dylan Hadfield-Menell

CoRR, 2023

Explore, Establish, Exploit: Red Teaming Language Models from Scratch.

[BibT_eX]

[DOI]

,

,

,

,

Dylan Hadfield-Menell

CoRR, 2023

Benchmarking Interpretability Tools for Deep Neural Networks.

[BibT_eX]

[DOI]

,

,

,

,

,

Dylan Hadfield-Menell

CoRR, 2023

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks.

[BibT_eX]

[DOI]

,

,

,

Dylan Hadfield-Menell

Proceedings of the 2023 IEEE Conference on Secure and Trustworthy Machine Learning, 2023

Red Teaming Deep Neural Networks with Feature Synthesis Tools.

[BibT_eX]

[DOI]

,

,

,

,

,

Kaivalya Hariharan

,

Dylan Hadfield-Menell

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

[BibT_eX]

[DOI]

,

,

Dylan Hadfield-Menell

,

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

White-Box Adversarial Policies in Deep Reinforcement Learning.

[BibT_eX]

[DOI]

,

Dylan Hadfield-Menell

,

Gabriel Kreiman

Proceedings of the Workshop on Artificial Intelligence Safety 2023 (SafeAI 2023) co-located with the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023), 2023

2022

Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks.

[BibT_eX]

[DOI]

,

Kaivalya Hariharan

,

Dylan Hadfield-Menell

CoRR, 2022

Robust Feature-Level Adversaries are Interpretability Tools.

[BibT_eX]

[DOI]

,

,

Dylan Hadfield-Menell

,

Gabriel Kreiman

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

2021

Detecting Modularity in Deep Neural Networks.

[BibT_eX]

[DOI]

,

,

,

,

,

CoRR, 2021

One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features.

[BibT_eX]

[DOI]

,

,

Gabriel Kreiman

CoRR, 2021

Clusterability in Neural Networks.

[BibT_eX]

[DOI]

,

,

,

,

,

CoRR, 2021

Frivolous Units: Wider Networks Are Not Really That Wide.

[BibT_eX]

[DOI]

,

,

Vanessa D'Amario

,

,

Martin Schrimpf

,

,

Gabriel Kreiman

Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020

The Achilles Heel Hypothesis: Pitfalls for AI Systems via Decision Theoretic Adversaries.

[BibT_eX]

[DOI]

CoRR, 2020

Probing Neural Dialog Models for Conversational Understanding.

[BibT_eX]

[DOI]

Abdelrhman Saleh

,

,

,

Yonatan Belinkov

,

Stuart M. Shieber

CoRR, 2020

2019

Removable and/or Repeated Units Emerge in Overparametrized Deep Neural Networks.

[BibT_eX]

[DOI]

,

,

Vanessa D'Amario

,

,

Martin Schrimpf

,

,

Gabriel Kreiman

CoRR, 2019

Loading...