Thomas Kwa
According to our database1,
Thomas Kwa
authored at least 4 papers
between 2020 and 2024.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
On csauthors.net:
Bibliography
2024
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification.
CoRR, 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques.
CoRR, 2024
2020