SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, March, 2025
Sparse Autoencoders Do Not Find Canonical Units of Analysis.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion.
Trans. Mach. Learn. Res., 2024
LLM Circuit Analyses Are Consistent Across Training and Scale.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Linear Representations of Sentiment in Large Language Models.
CoRR, 2023
Can Transformers Learn to Solve Problems Recursively?
CoRR, 2023