Nora Belrose

According to our database1, Nora Belrose authored at least 19 papers between 2022 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
Examining Two Hop Reasoning Through Information Content Scaling.
CoRR, February, 2025

Slowing Learning by Erasing Simple Features.
CoRR, February, 2025

Converting MLPs into Polynomials in Closed Form.
CoRR, February, 2025

Partially Rewriting a Transformer in Natural Language.
CoRR, January, 2025

Transcoders Beat Sparse Autoencoders for Interpretability.
CoRR, January, 2025

Estimating the Probability of Sampling a Trained Neural Network at Random.
CoRR, January, 2025

Sparse Autoencoders Trained on the Same Data Learn Different Features.
CoRR, January, 2025

2024
Understanding Gradient Descent through the Training Jacobian.
CoRR, 2024

Refusal in LLMs is an Affine Function.
CoRR, 2024

Automatically Interpreting Millions of Features in Large Language Models.
CoRR, 2024

Balancing Label Quantity and Quality for Scalable Elicitation.
CoRR, 2024

Does Transformer Interpretability Transfer to RNNs?
CoRR, 2024

Neural Networks Learn Statistics of Increasing Complexity.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

2023
Eliciting Latent Knowledge from Quirky Language Models.
CoRR, 2023

Eliciting Latent Predictions from Transformers with the Tuned Lens.
CoRR, 2023

LEACE: Perfect linear concept erasure in closed form.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Adversarial Policies Beat Superhuman Go AIs.
Proceedings of the International Conference on Machine Learning, 2023

2022
imitation: Clean Imitation Learning Implementations.
CoRR, 2022

Adversarial Policies Beat Professional-Level Go AIs.
CoRR, 2022


  Loading...