Evan Hubinger

According to our database1, Evan Hubinger authored at least 14 papers between 2019 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Sabotage Evaluations for Frontier Models.
CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.
CoRR, 2024

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant.
CoRR, 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
CoRR, 2024

Steering Llama 2 via Contrastive Activation Addition.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

2023
Studying Large Language Model Generalization with Influence Functions.
CoRR, 2023

Measuring Faithfulness in Chain-of-Thought Reasoning.
CoRR, 2023

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning.
CoRR, 2023

Conditioning Predictive Models: Risks and Strategies.
CoRR, 2023


2022
Discovering Language Model Behaviors with Model-Written Evaluations.
CoRR, 2022

Engineering Monosemanticity in Toy Models.
CoRR, 2022

2020
An overview of 11 proposals for building safe advanced AI.
CoRR, 2020

2019
Risks from Learned Optimization in Advanced Machine Learning Systems.
CoRR, 2019


  Loading...