Parameterized Synthetic Text Generation with SimpleStories.
CoRR, April, 2025
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition.
CoRR, January, 2025
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks.
CoRR, 2024
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability.
CoRR, 2024
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Interpreting Neural Networks through the Polytope Lens.
,
,
,
,
,
,
,
,
,
,
CoRR, 2022
Construction and Elicitation of a Black Box Model in the Game of Bridge.
CoRR, 2020