Stephen Casper

Orcid: 0000-0003-0084-1937

According to our database1, Stephen Casper authored at least 28 papers between 2019 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience.
CoRR, 2024

The AI Risk Repository: A Comprehensive Meta-Review, Database, and Taxonomy of Risks From Artificial Intelligence.
CoRR, 2024

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.
CoRR, 2024

Open Problems in Technical AI Governance.
CoRR, 2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.
CoRR, 2024

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability.
CoRR, 2024

Defending Against Unforeseen Failure Modes with Latent Adversarial Training.
CoRR, 2024

Eight Methods to Evaluate Robust Unlearning in LLMs.
CoRR, 2024

Rethinking Machine Unlearning for Large Language Models.
CoRR, 2024


2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.
Trans. Mach. Learn. Res., 2023

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation.
CoRR, 2023

Measuring the Success of Diffusion Models at Imitating Human Artists.
CoRR, 2023

Explore, Establish, Exploit: Red Teaming Language Models from Scratch.
CoRR, 2023

Benchmarking Interpretability Tools for Deep Neural Networks.
CoRR, 2023

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks.
Proceedings of the 2023 IEEE Conference on Secure and Trustworthy Machine Learning, 2023

Red Teaming Deep Neural Networks with Feature Synthesis Tools.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

White-Box Adversarial Policies in Deep Reinforcement Learning.
Proceedings of the Workshop on Artificial Intelligence Safety 2023 (SafeAI 2023) co-located with the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023), 2023

2022
Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks.
CoRR, 2022

Robust Feature-Level Adversaries are Interpretability Tools.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

2021
Detecting Modularity in Deep Neural Networks.
CoRR, 2021

One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features.
CoRR, 2021

Clusterability in Neural Networks.
CoRR, 2021

Frivolous Units: Wider Networks Are Not Really That Wide.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020
The Achilles Heel Hypothesis: Pitfalls for AI Systems via Decision Theoretic Adversaries.
CoRR, 2020

Probing Neural Dialog Models for Conversational Understanding.
CoRR, 2020

2019
Removable and/or Repeated Units Emerge in Overparametrized Deep Neural Networks.
CoRR, 2019


  Loading...