Scaling Laws For Scalable Oversight.
CoRR, April, 2025
Towards Understanding Distilled Reasoning Models: A Representational Approach.
CoRR, March, 2025
Harmonic Loss Trains Interpretable AI Models.
CoRR, February, 2025
The Geometry of Concepts: Sparse Autoencoder Feature Structure.
CoRR, 2024
Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning.
CoRR, 2024
GenEFT: Understanding Statics and Dynamics of Model Generalization via Effective Theory.
CoRR, 2024