Mikita Balesni

According to our database¹, Mikita Balesni authored at least 8 papers between 2023 and 2024.

Collaborative distances:

Dijkstra number² of five.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2024

Frontier Models are Capable of In-context Scheming.

[BibT_eX]

[DOI]

CoRR, 2024

The Two-Hop Curse: LLMs trained on A->B, B->C fail to learn A->C.

[BibT_eX]

[DOI]

Mikita Balesni

Tomasz Korbak

Owain Evans

CoRR, 2024

Towards evaluations-based safety cases for AI scheming.

[BibT_eX]

[DOI]

Nicholas Goldowsky-Dill

CoRR, 2024

Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack.

[BibT_eX]

[DOI]

Leo McKee-Reid

Christoph Sträter

Maria Angelica Martinez

Joe Needham

Mikita Balesni

CoRR, 2024

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A".

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023

Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure.

[BibT_eX]

[DOI]

Jérémy Scheurer

Mikita Balesni

Marius Hobbhahn

CoRR, 2023

Taken out of context: On measuring situational awareness in LLMs.

[BibT_eX]

[DOI]

CoRR, 2023

Mikita Balesni

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...