2025

Reasoning Models Don't Always Say What They Think.

[DOI]

,

,

Ansh Radhakrishnan

,

Jonathan Uesato

,

,

,

,

,

,

,

Vladimir Mikulik

,

Samuel R. Bowman

,

,

,

CoRR, May, 2025

Rethinking machine unlearning for large language models.

[DOI]

,

,

,

,

Nathalie Baracaldo

,

,

,

Chris Yuhao Liu

,

,

,

Kush R. Varshney

,

,

,

Nat. Mac. Intell., 2025

Teaching Models to Balance Resisting and Accepting Persuasion.

[DOI]

Elias Stengel-Eskin

,

,

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

System 1.x: Learning to Balance Fast and Slow Planning with Language Models.

[DOI]

Swarnadeep Saha

,

,

Justin Chih-Yao Chen

,

,

Elias Stengel-Eskin

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

Interpretable and Controllable Language Models.

[DOI]

PhD thesis, 2024

INSPIRE: Incorporating Diverse Feature Preferences in Recourse.

[DOI]

,

,

Trans. Mach. Learn. Res., 2024

Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation.

[DOI]

,

,

,

,

,

Trans. Mach. Learn. Res., 2024

Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?

[DOI]

,

Thomas Hofweber

,

,

Elias Stengel-Eskin

,

Trans. Mach. Learn. Res., 2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.

[DOI]

Trans. Mach. Learn. Res., 2024

Are language models rational? The case of coherence norms and belief revision.

[DOI]

Thomas Hofweber

,

,

Elias Stengel-Eskin

,

CoRR, 2024

LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models.

[DOI]

Elias Stengel-Eskin

,

,

CoRR, 2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.

[DOI]

CoRR, 2024

Rethinking Machine Unlearning for Large Language Models.

[DOI]

,

,

,

,

Nathalie Baracaldo

,

,

,

,

,

Kush R. Varshney

,

,

,

CoRR, 2024

LACIE: Listener-Aware Finetuning for Calibration in Large Language Models.

[DOI]

Elias Stengel-Eskin

,

,

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks.

[DOI]

,

,

Proceedings of the Twelfth International Conference on Learning Representations, 2024

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks.

[DOI]

,

,

,

Sarah Wiegreffe

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

2023

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.

[DOI]

Trans. Mach. Learn. Res., 2023

Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Theory of Mind.

[DOI]

Swarnadeep Saha

,

,

CoRR, 2023

Adaptive Contextual Perception: How To Generalize To New Backgrounds and Ambiguous Objects.

[DOI]

,

,

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Can Language Models Teach? Teacher Explanations Improve Student Performance via Personalization.

[DOI]

Swarnadeep Saha

,

,

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models.

[DOI]

,

,

,

Asma Ghandeharioun

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees.

[DOI]

Swarnadeep Saha

,

,

,

Proceedings of the Eleventh International Conference on Learning Representations, 2023

GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models.

[DOI]

,

,

,

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

Methods for Measuring, Updating, and Visualizing Factual Beliefs in Language Models.

[DOI]

,

,

Asli Celikyilmaz

,

,

Zornitsa Kozareva

,

Veselin Stoyanov

,

,

Srinivasan Iyer

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

2022

VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives.

[DOI]

,

,

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations.

[DOI]

Swarnadeep Saha

,

,

,

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

2021

Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs.

[DOI]

,

,

Asli Celikyilmaz

,

,

Zornitsa Kozareva

,

Veselin Stoyanov

,

,

Srinivasan Iyer

CoRR, 2021

Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions.

[DOI]

,

,

CoRR, 2021

Search Methods for Sufficient, Socially-Aligned Feature Importance Explanations with In-Distribution Counterfactuals.

[DOI]

,

,

CoRR, 2021

When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data.

[DOI]

,

CoRR, 2021

The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations.

[DOI]

,

,

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging.

[DOI]

,

,

,

,

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

2020

Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?

[DOI]

,

,

,

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, 2020

Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?

[DOI]

,

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

2019

Interpretable Image Recognition with Hierarchical Prototypes.

[DOI]

,

,

,

Proceedings of the Seventh AAAI Conference on Human Computation and Crowdsourcing, 2019

2018

Shall I Compare Thee to a Machine-Written Sonnet? An Approach to Algorithmic Sonnet Generation.

[DOI]

,

,

,

,

CoRR, 2018

1997

An User Adaptive Navigation Metaphor to Connect and Rate the Coherence of Terms and Complex Objects.

[DOI]

Holger Husemann

,

,

Christian Kanty

,

Hans-Dieter Kochs

,

Proceedings of the Hypertext 97, 1997