2025

Parameterized Synthetic Text Generation with SimpleStories.

[DOI]

Lennart Finke

Thomas Dooms

CoRR, April, 2025

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition.

[DOI]

CoRR, January, 2025

2024

Towards evaluations-based safety cases for AI scheming.

[DOI]

Nicholas Goldowsky-Dill

CoRR, 2024

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks.

[DOI]

Lucius Bushnaq

Stefan Heimersheim

Nicholas Goldowsky-Dill

CoRR, 2024

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability.

[DOI]

Nicholas Goldowsky-Dill

Kaarel Hänni

Cindy Wu

Marius Hobbhahn

CoRR, 2024

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning.

[DOI]

Dan Braun

Jordan Taylor

Nicholas Goldowsky-Dill

Lee Sharkey

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

2022

Interpreting Neural Networks through the Polytope Lens.

[DOI]

CoRR, 2022

2020

Construction and Elicitation of a Black Box Model in the Game of Bridge.

[DOI]

Véronique Ventos

Daniel Braun

Colin Deheeger

Jean Pierre Desmoulins

CoRR, 2020

2013

Structural learning.

[DOI]

Dan Braun

Scholarpedia, 2013