2025

Mechanistic Anomaly Detection for "Quirky" Language Models.

[DOI]

David Johnston

Arkajyoti Chakraborty

Nora Belrose

CoRR, April, 2025

Examining Two Hop Reasoning Through Information Content Scaling.

[DOI]

David Johnston

Nora Belrose

CoRR, February, 2025

Slowing Learning by Erasing Simple Features.

[DOI]

Lucia Quirke

Nora Belrose

CoRR, February, 2025

Converting MLPs into Polynomials in Closed Form.

[DOI]

Nora Belrose

Alice Rigg

CoRR, February, 2025

Partially Rewriting a Transformer in Natural Language.

[DOI]

Gonçalo Paulo

Nora Belrose

CoRR, January, 2025

Transcoders Beat Sparse Autoencoders for Interpretability.

[DOI]

Gonçalo Paulo

Stepan Shabalin

Nora Belrose

CoRR, January, 2025

Estimating the Probability of Sampling a Trained Neural Network at Random.

[DOI]

Adam Scherlis

Nora Belrose

CoRR, January, 2025

Sparse Autoencoders Trained on the Same Data Learn Different Features.

[DOI]

Gonçalo Paulo

Nora Belrose

CoRR, January, 2025

Do Transformer Interpretability Methods Transfer to RNNs?

[DOI]

Gonçalo Paulo

Thomas Marshall

Nora Belrose

Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024

Understanding Gradient Descent through the Training Jacobian.

[DOI]

Nora Belrose

Adam Scherlis

CoRR, 2024

Refusal in LLMs is an Affine Function.

[DOI]

Thomas Marshall

Adam Scherlis

Nora Belrose

CoRR, 2024

Automatically Interpreting Millions of Features in Large Language Models.

[DOI]

CoRR, 2024

Balancing Label Quantity and Quality for Scalable Elicitation.

[DOI]

Alex Mallen

Nora Belrose

CoRR, 2024

Does Transformer Interpretability Transfer to RNNs?

[DOI]

Gonçalo Paulo

Thomas Marshall

Nora Belrose

CoRR, 2024

Neural Networks Learn Statistics of Increasing Complexity.

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

2023

Eliciting Latent Knowledge from Quirky Language Models.

[DOI]

Alex Mallen

Nora Belrose

CoRR, 2023

Eliciting Latent Predictions from Transformers with the Tuned Lens.

[DOI]

CoRR, 2023

LEACE: Perfect linear concept erasure in closed form.

[DOI]

Nora Belrose

David Schneider-Joseph

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Adversarial Policies Beat Superhuman Go AIs.

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

2022

imitation: Clean Imitation Learning Implementations.

[DOI]

CoRR, 2022

Adversarial Policies Beat Professional-Level Go AIs.

[DOI]

CoRR, 2022