Physics of Skill Learning.
CoRR, January, 2025
The Geometry of Concepts: Sparse Autoencoder Feature Structure.
CoRR, 2024
Efficient Dictionary Learning with Switch Sparse Autoencoders.
CoRR, 2024
Survival of the Fittest Representation: A Case Study with Modular Addition.
CoRR, 2024
Not All Language Model Features Are Linear.
CoRR, 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.
CoRR, 2024
Opening the AI black box: program synthesis via mechanistic interpretability.
CoRR, 2024
Precision Machine Learning.
Entropy, January, 2023
The Quantization Model of Neural Scaling.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
Omnigrok: Grokking Beyond Algorithmic Data.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
Towards Understanding Grokking: An Effective Theory of Representation Learning.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
Examining the Causal Structures of Deep Neural Networks Using Information Theory.
Entropy, 2020
Understanding Learned Reward Functions.
CoRR, 2020
Examining the causal structures of deep neural networks using information theory.
CoRR, 2020