Improving Steering Vectors by Targeting Sparse Autoencoder Features.
CoRR, 2024
Applying sparse autoencoders to unlearn knowledge in language models.
CoRR, 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2.
CoRR, 2024
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders.
CoRR, 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders.
CoRR, 2024
Improving Dictionary Learning with Gated Sparse Autoencoders.
CoRR, 2024
Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Stealing part of a production language model.
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Forty-first International Conference on Machine Learning, 2024
Successor Heads: Recurring, Interpretable Attention Heads In The Wild.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
Attribution Patching Outperforms Automated Circuit Discovery.
CoRR, 2023
Copy Suppression: Comprehensively Understanding an Attention Head.
CoRR, 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
Stylegan-Induced Data-Driven Regularization for Inverse Problems.
Proceedings of the IEEE International Conference on Acoustics, 2022