Neel Nanda

According to our database1, Neel Nanda authored at least 37 papers between 2021 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

2021
2022
2023
2024
0
5
10
15
20
13
8
4
1
7
3
1

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Universal Neurons in GPT2 Language Models.
Trans. Mach. Learn. Res., 2024

BatchTopK Sparse Autoencoders.
CoRR, 2024

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks.
CoRR, 2024

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models.
CoRR, 2024

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2.
CoRR, 2024

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders.
CoRR, 2024

Interpreting Attention Layer Outputs with Sparse Autoencoders.
CoRR, 2024

Refusal in Language Models Is Mediated by a Single Direction.
CoRR, 2024

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control.
CoRR, 2024

Improving Dictionary Learning with Gated Sparse Autoencoders.
CoRR, 2024

How to use and interpret activation patching.
CoRR, 2024

AtP*: An efficient and scalable method for localizing LLM behaviour to components.
CoRR, 2024

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs.
CoRR, 2024

Confidence Regulation Neurons in Language Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Transcoders find interpretable LLM feature circuits.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Refusal in Language Models Is Mediated by a Single Direction.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Explorations of Self-Repair in Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing.
Trans. Mach. Learn. Res., 2023

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.
CoRR, 2023

Training Dynamics of Contextual N-Grams in Language Models.
CoRR, 2023

Linear Representations of Sentiment in Large Language Models.
CoRR, 2023

Copy Suppression: Comprehensively Understanding an Attention Head.
CoRR, 2023

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla.
CoRR, 2023

Neuron to Graph: Interpreting Language Model Neurons at Scale.
CoRR, 2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models.
CoRR, 2023

A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations.
Proceedings of the International Conference on Machine Learning, 2023

Progress measures for grokking via mechanistic interpretability.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

Emergent Linear Representations in World Models of Self-Supervised Sequence Models.
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023

2022
Fully General Online Imitation Learning.
J. Mach. Learn. Res., 2022

In-context Learning and Induction Heads.
CoRR, 2022

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.
CoRR, 2022

Predictability and Surprise in Large Generative Models.
CoRR, 2022


2021
An Empirical Investigation of Learning from Biased Toxicity Labels.
CoRR, 2021


  Loading...