Mechanistic Anomaly Detection for "Quirky" Language Models.
CoRR, April, 2025
Examining Two Hop Reasoning Through Information Content Scaling.
CoRR, February, 2025
Slowing Learning by Erasing Simple Features.
CoRR, February, 2025
Converting MLPs into Polynomials in Closed Form.
CoRR, February, 2025
Partially Rewriting a Transformer in Natural Language.
CoRR, January, 2025
Transcoders Beat Sparse Autoencoders for Interpretability.
CoRR, January, 2025
Estimating the Probability of Sampling a Trained Neural Network at Random.
CoRR, January, 2025
Sparse Autoencoders Trained on the Same Data Learn Different Features.
CoRR, January, 2025
Do Transformer Interpretability Methods Transfer to RNNs?
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025
Understanding Gradient Descent through the Training Jacobian.
CoRR, 2024
Refusal in LLMs is an Affine Function.
CoRR, 2024
Automatically Interpreting Millions of Features in Large Language Models.
CoRR, 2024
Balancing Label Quantity and Quality for Scalable Elicitation.
CoRR, 2024
Does Transformer Interpretability Transfer to RNNs?
CoRR, 2024
Neural Networks Learn Statistics of Increasing Complexity.
Proceedings of the Forty-first International Conference on Machine Learning, 2024
Eliciting Latent Knowledge from Quirky Language Models.
CoRR, 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens.
CoRR, 2023
LEACE: Perfect linear concept erasure in closed form.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
Adversarial Policies Beat Superhuman Go AIs.
,
,
,
,
,
,
,
,
,
,
Proceedings of the International Conference on Machine Learning, 2023
imitation: Clean Imitation Learning Implementations.
CoRR, 2022
Adversarial Policies Beat Professional-Level Go AIs.
CoRR, 2022