Neel Nanda
According to our database1,
Neel Nanda
authored at least 37 papers
between 2021 and 2024.
Collaborative distances:
Collaborative distances:
Timeline
2021
2022
2023
2024
0
5
10
15
20
13
8
4
1
7
3
1
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
On csauthors.net:
Bibliography
2024
CoRR, 2024
CoRR, 2024
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control.
CoRR, 2024
CoRR, 2024
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Proceedings of the Forty-first International Conference on Machine Learning, 2024
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
2023
Trans. Mach. Learn. Res., 2023
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching.
CoRR, 2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla.
CoRR, 2023
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models.
CoRR, 2023
A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations.
Proceedings of the International Conference on Machine Learning, 2023
Proceedings of the Eleventh International Conference on Learning Representations, 2023
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023
2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.
CoRR, 2022
Proceedings of the FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, June 21, 2022
2021