Mikita Balesni
According to our database1,
Mikita Balesni
authored at least 8 papers
between 2023 and 2024.
Collaborative distances:
Collaborative distances:
Timeline
2023
2024
0
1
2
3
4
5
6
7
4
2
2
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
On csauthors.net:
Bibliography
2024
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack.
CoRR, 2024
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Proceedings of the Twelfth International Conference on Learning Representations, 2024
2023
Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure.
CoRR, 2023