2025
Reasoning Models Don't Always Say What They Think.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, May, 2025
Rethinking machine unlearning for large language models.
,
,
,
,
,
,
,
,
,
,
,
,
,
Nat. Mac. Intell., 2025
Teaching Models to Balance Resisting and Accepting Persuasion.
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025
System 1.x: Learning to Balance Fast and Slow Planning with Language Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
2024
Interpretable and Controllable Language Models.
PhD thesis, 2024
INSPIRE: Incorporating Diverse Feature Preferences in Recourse.
Trans. Mach. Learn. Res., 2024
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation.
Trans. Mach. Learn. Res., 2024
Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?
Trans. Mach. Learn. Res., 2024
Foundational Challenges in Assuring Alignment and Safety of Large Language Models.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Trans. Mach. Learn. Res., 2024
Are language models rational? The case of coherence norms and belief revision.
CoRR, 2024
LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models.
CoRR, 2024
Foundational Challenges in Assuring Alignment and Safety of Large Language Models.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Rethinking Machine Unlearning for Large Language Models.
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
LACIE: Listener-Aware Finetuning for Calibration in Large Language Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Trans. Mach. Learn. Res., 2023
Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Theory of Mind.
CoRR, 2023
Adaptive Contextual Perception: How To Generalize To New Backgrounds and Ambiguous Objects.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
Can Language Models Teach? Teacher Explanations Improve Student Performance via Personalization.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models.
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023
Methods for Measuring, Updating, and Visualizing Factual Beliefs in Language Models.
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023
2022
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
2021
Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs.
CoRR, 2021
Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions.
CoRR, 2021
Search Methods for Sufficient, Socially-Aligned Feature Importance Explanations with In-Distribution Counterfactuals.
CoRR, 2021
When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data.
CoRR, 2021
The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021
FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021
2020
Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, 2020
Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
2019
Interpretable Image Recognition with Hierarchical Prototypes.
Proceedings of the Seventh AAAI Conference on Human Computation and Crowdsourcing, 2019
2018
Shall I Compare Thee to a Machine-Written Sonnet? An Approach to Algorithmic Sonnet Generation.
CoRR, 2018
1997
An User Adaptive Navigation Metaphor to Connect and Rate the Coherence of Terms and Complex Objects.
Proceedings of the Hypertext 97, 1997