2025

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability.

[DOI]

Adam Karvonen

Can Rager

CoRR, March, 2025

Sparse Autoencoders Do Not Find Canonical Units of Analysis.

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion.

[DOI]

Trans. Mach. Learn. Res., 2024

LLM Circuit Analyses Are Consistent Across Training and Scale.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

2023

Linear Representations of Sentiment in Large Language Models.

[DOI]

Curt Tigges

Oskar John Hollinsworth

Atticus Geiger

Neel Nanda

CoRR, 2023

Can Transformers Learn to Solve Problems Recursively?

[DOI]

CoRR, 2023