Neel Nanda

According to our database¹, Neel Nanda authored at least 37 papers between 2021 and 2024.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

2021

2022

2023

2024

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2024

Universal Neurons in GPT2 Language Models.

[BibT_eX]

[DOI]

Wes Gurnee

Theo Horsley

Zifan Carl Guo

Tara Rezaei Kheirkhah

Trans. Mach. Learn. Res., 2024

BatchTopK Sparse Autoencoders.

[BibT_eX]

[DOI]

Bart Bussmann

Patrick Leask

Neel Nanda

CoRR, 2024

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks.

[BibT_eX]

[DOI]

CoRR, 2024

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models.

[BibT_eX]

[DOI]

Javier Ferrando

Oscar Obeso

Senthooran Rajamanoharan

Neel Nanda

CoRR, 2024

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2.

[BibT_eX]

[DOI]

Tom Lieberum

Senthooran Rajamanoharan

CoRR, 2024

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders.

[BibT_eX]

[DOI]

Senthooran Rajamanoharan

CoRR, 2024

Interpreting Attention Layer Outputs with Sparse Autoencoders.

[BibT_eX]

[DOI]

CoRR, 2024

Refusal in Language Models Is Mediated by a Single Direction.

[BibT_eX]

[DOI]

CoRR, 2024

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control.

[BibT_eX]

[DOI]

Aleksandar Makelov

Georg Lange

Neel Nanda

CoRR, 2024

Improving Dictionary Learning with Gated Sparse Autoencoders.

[BibT_eX]

[DOI]

Senthooran Rajamanoharan

CoRR, 2024

How to use and interpret activation patching.

[BibT_eX]

[DOI]

Stefan Heimersheim

Neel Nanda

CoRR, 2024

AtP*: An efficient and scalable method for localizing LLM behaviour to components.

[BibT_eX]

[DOI]

CoRR, 2024

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs.

[BibT_eX]

[DOI]

Bilal Chughtai

Alan Cooney

Neel Nanda

CoRR, 2024

Confidence Regulation Neurons in Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders.

[BibT_eX]

[DOI]

Senthooran Rajamanoharan

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Transcoders find interpretable LLM feature circuits.

[BibT_eX]

[DOI]

Jacob Dunefsky

Philippe Chlenski

Neel Nanda

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Refusal in Language Models Is Mediated by a Single Direction.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Explorations of Self-Repair in Language Models.

[BibT_eX]

[DOI]

Cody Rushing

Neel Nanda

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods.

[BibT_eX]

[DOI]

Fred Zhang

Neel Nanda

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023

Finding Neurons in a Haystack: Case Studies with Sparse Probing.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2023

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.

[BibT_eX]

[DOI]

Aleksandar Makelov

Georg Lange

Neel Nanda

CoRR, 2023

Training Dynamics of Contextual N-Grams in Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Linear Representations of Sentiment in Large Language Models.

[BibT_eX]

[DOI]

Curt Tigges

Oskar John Hollinsworth

Atticus Geiger

Neel Nanda

CoRR, 2023

Copy Suppression: Comprehensively Understanding an Attention Head.

[BibT_eX]

[DOI]

CoRR, 2023

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla.

[BibT_eX]

[DOI]

CoRR, 2023

Neuron to Graph: Interpreting Language Model Neurons at Scale.

[BibT_eX]

[DOI]

CoRR, 2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations.

[BibT_eX]

[DOI]

Bilal Chughtai

Lawrence Chan

Neel Nanda

Proceedings of the International Conference on Machine Learning, 2023

Progress measures for grokking via mechanistic interpretability.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Emergent Linear Representations in World Models of Self-Supervised Sequence Models.

[BibT_eX]

[DOI]

Neel Nanda

Andrew Lee

Martin Wattenberg

Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023

2022

Fully General Online Imitation Learning.

[BibT_eX]

[DOI]

Michael K. Cohen

Marcus Hutter

Neel Nanda

J. Mach. Learn. Res., 2022

In-context Learning and Induction Heads.

[BibT_eX]

[DOI]

CoRR, 2022

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.

[BibT_eX]

[DOI]

CoRR, 2022

Predictability and Surprise in Large Generative Models.

[BibT_eX]

[DOI]

CoRR, 2022

Predictability and Surprise in Large Generative Models.

[BibT_eX]

[DOI]

Proceedings of the FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, June 21, 2022

2021

An Empirical Investigation of Learning from Biased Toxicity Labels.

[BibT_eX]

[DOI]

Neel Nanda

Jonathan Uesato

Sven Gowal

CoRR, 2021

Neel Nanda

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...