Thomas Kwa

According to our database1, Thomas Kwa authored at least 4 papers between 2020 and 2024.

Collaborative distances:
  • Dijkstra number2 of five.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification.
CoRR, 2024

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques.
CoRR, 2024

Compact Proofs of Model Performance via Mechanistic Interpretability.
CoRR, 2024

2020
Securing Smart Home Edge Devices against Compromised Cloud Servers.
CoRR, 2020


  Loading...