Stephen Casper
Orcid: 0000-0003-0084-1937
According to our database1,
Stephen Casper
authored at least 28 papers
between 2019 and 2024.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
On csauthors.net:
Bibliography
2024
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience.
CoRR, 2024
The AI Risk Repository: A Comprehensive Meta-Review, Database, and Taxonomy of Risks From Artificial Intelligence.
CoRR, 2024
Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.
CoRR, 2024
CoRR, 2024
The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability.
CoRR, 2024
CoRR, 2024
Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024
2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.
Trans. Mach. Learn. Res., 2023
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation.
CoRR, 2023
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks.
Proceedings of the 2023 IEEE Conference on Secure and Trustworthy Machine Learning, 2023
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
Proceedings of the Workshop on Artificial Intelligence Safety 2023 (SafeAI 2023) co-located with the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023), 2023
2022
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
2021
One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features.
CoRR, 2021
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021
2020
The Achilles Heel Hypothesis: Pitfalls for AI Systems via Decision Theoretic Adversaries.
CoRR, 2020
2019
CoRR, 2019