We stand with Ukraine

We stand with Ukraine

Ethan Perez

According to our database¹, Ethan Perez authored at least 55 papers between 2016 and 2024.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

2016

2017

2018

2019

2020

2021

2022

2023

2024

0

5

10

15

20

15

10

7

1

1

4

4

2

3

2

2

3

1

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2024

Learning from Natural Language Feedback.

[BibT_eX]

[DOI]

,

Jérémy Scheurer

,

Jon Ander Campos

,

,

,

Samuel R. Bowman

,

,

Trans. Mach. Learn. Res., 2024

Alignment faking in large language models.

[BibT_eX]

[DOI]

CoRR, 2024

Best-of-N Jailbreaking.

[BibT_eX]

[DOI]

,

,

,

Rylan Schaeffer

,

,

,

,

,

,

CoRR, 2024

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach.

[BibT_eX]

[DOI]

,

,

,

Rylan Schaeffer

,

Rajashree Agrawal

,

,

,

,

,

CoRR, 2024

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats.

[BibT_eX]

[DOI]

,

,

,

,

Ansh Radhakrishnan

,

,

,

,

,

,

,

CoRR, 2024

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems.

[BibT_eX]

[DOI]

Caspar Oesterheld

,

,

,

Linh Chi Nguyen

,

CoRR, 2024

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples.

[BibT_eX]

[DOI]

,

,

,

,

CoRR, 2024

Sabotage Evaluations for Frontier Models.

[BibT_eX]

[DOI]

,

,

Eric Christiansen

,

,

,

,

,

,

,

,

,

Holden Karnofsky

,

,

,

Samuel R. Bowman

,

CoRR, 2024

Looking Inward: Language Models Can Learn About Themselves by Introspection.

[BibT_eX]

[DOI]

Felix J. Binder

,

,

,

,

,

,

,

,

CoRR, 2024

Language Models Learn to Mislead Humans via RLHF.

[BibT_eX]

[DOI]

,

,

,

,

Jacob Steinhardt

,

,

Samuel R. Bowman

,

,

CoRR, 2024

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.

[BibT_eX]

[DOI]

Abhay Sheshadri

,

,

,

,

,

,

,

Asa Cooper Stickland

,

,

Dylan Hadfield-Menell

,

CoRR, 2024

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

[BibT_eX]

[DOI]

Rylan Schaeffer

,

,

,

,

Cristóbal Eyzaguirre

,

,

,

,

,

,

Rajashree Agrawal

,

,

,

,

CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.

[BibT_eX]

[DOI]

,

Monte MacDiarmid

,

,

,

,

,

Nicholas Schiefer

,

,

,

,

,

Samuel R. Bowman

,

,

CoRR, 2024

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought.

[BibT_eX]

[DOI]

,

,

,

Samuel R. Bowman

,

,

,

CoRR, 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.

[BibT_eX]

[DOI]

CoRR, 2024

Many-shot Jailbreaking.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Debating with More Persuasive LLMs Leads to More Truthful Answers.

[BibT_eX]

[DOI]

,

,

,

,

,

Ansh Radhakrishnan

,

Edward Grefenstette

,

Samuel R. Bowman

,

Tim Rocktäschel

,

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Towards Understanding Sycophancy in Language Models.

[BibT_eX]

[DOI]

,

,

,

,

,

Samuel R. Bowman

,

,

Zac Hatfield-Dodds

,

Scott R. Johnston

,

,

Timothy Maxwell

,

,

,

,

Nicholas Schiefer

,

,

,

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning.

[BibT_eX]

[DOI]

,

Victoriano Montesinos

,

,

,

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023

Inverse Scaling: When Bigger Isn't Better.

[BibT_eX]

[DOI]

Trans. Mach. Learn. Res., 2023

Towards Evaluating AI Systems for Moral Status Using Self-Reports.

[BibT_eX]

[DOI]

,

CoRR, 2023

Specific versus General Principles for Constitutional AI.

[BibT_eX]

[DOI]

CoRR, 2023

Towards Understanding Sycophancy in Language Models.

[BibT_eX]

[DOI]

,

,

,

,

,

Samuel R. Bowman

,

,

,

Zac Hatfield-Dodds

,

Scott R. Johnston

,

,

Timothy Maxwell

,

,

,

,

Nicholas Schiefer

,

,

,

CoRR, 2023

Studying Large Language Model Generalization with Influence Functions.

[BibT_eX]

[DOI]

Roger B. Grosse

,

,

,

,

,

Amirhossein Tajdini

,

,

,

,

,

,

Kamile Lukosiute

,

,

Nicholas Joseph

,

,

,

Samuel R. Bowman

CoRR, 2023

Measuring Faithfulness in Chain-of-Thought Reasoning.

[BibT_eX]

[DOI]

CoRR, 2023

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning.

[BibT_eX]

[DOI]

CoRR, 2023

Training Language Models with Language Feedback at Scale.

[BibT_eX]

[DOI]

Jérémy Scheurer

,

Jon Ander Campos

,

,

,

,

,

CoRR, 2023

Improving Code Generation by Training with Natural Language Feedback.

[BibT_eX]

[DOI]

,

Jérémy Scheurer

,

,

Jon Ander Campos

,

,

Samuel R. Bowman

,

,

CoRR, 2023

The Capacity for Moral Self-Correction in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.

[BibT_eX]

[DOI]

,

,

,

Samuel R. Bowman

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Pretraining Language Models with Human Preferences.

[BibT_eX]

[DOI]

,

,

,

Rasika Vinayak Bhalerao

,

Christopher L. Buckley

,

,

Samuel R. Bowman

,

Proceedings of the International Conference on Machine Learning, 2023

Discovering Language Model Behaviors with Model-Written Evaluations.

[BibT_eX]

[DOI]

,

,

Kamile Lukosiute

,

,

,

,

,

Catherine Olsson

,

,

Saurav Kadavath

,

,

,

,

,

,

Cameron McKinnon

,

Christopher Olah

,

,

,

,

,

,

Eli Tran-Johnson

,

,

Jackson Kernion

,

,

,

,

,

,

,

Landon Goldberg

,

,

,

Michael Sellitto

,

,

Neerav Kingsland

,

,

Nicholas Joseph

,

,

,

,

,

,

,

,

,

,

Timothy Telleen-Lawton

,

,

,

,

,

Zac Hatfield-Dodds

,

,

Samuel R. Bowman

,

,

,

Danny Hernandez

,

,

,

Nicholas Schiefer

,

Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

Few-shot Adaptation Works with UnpredicTable Data.

[BibT_eX]

[DOI]

,

,

,

Jérémy Scheurer

,

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022

Finding and Fixing Undesirable Behaviors in Pretrained Language Models.

[BibT_eX]

[DOI]

PhD thesis, 2022

Discovering Language Model Behaviors with Model-Written Evaluations.

[BibT_eX]

[DOI]

,

,

Kamile Lukosiute

,

,

,

,

,

Catherine Olsson

,

,

Saurav Kadavath

,

,

,

,

,

,

Cameron McKinnon

,

Christopher Olah

,

,

,

,

,

,

Eli Tran-Johnson

,

,

Jackson Kernion

,

,

,

,

,

,

,

Landon Goldberg

,

,

,

Michael Sellitto

,

,

Neerav Kingsland

,

,

Nicholas Joseph

,

,

,

,

,

,

,

,

,

,

Timothy Telleen-Lawton

,

,

,

,

,

Zac Hatfield-Dodds

,

,

Samuel R. Bowman

,

,

,

Danny Hernandez

,

,

,

Nicholas Schiefer

,

CoRR, 2022

Constitutional AI: Harmlessness from AI Feedback.

[BibT_eX]

[DOI]

CoRR, 2022

Measuring Progress on Scalable Oversight for Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2022

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned.

[BibT_eX]

[DOI]

CoRR, 2022

Language Models (Mostly) Know What They Know.

[BibT_eX]

[DOI]

CoRR, 2022

Learning from Natural Language Feedback.

[BibT_eX]

[DOI]

Jérémy Scheurer

,

Jon Ander Campos

,

,

,

,

CoRR, 2022

Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions.

[BibT_eX]

[DOI]

,

,

,

,

,

,

Samuel R. Bowman

CoRR, 2022

Red Teaming Language Models with Language Models.

[BibT_eX]

[DOI]

,

,

H. Francis Song

,

,

,

,

,

,

Geoffrey Irving

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

RL with KL penalties is better viewed as Bayesian inference.

[BibT_eX]

[DOI]

,

,

Christopher L. Buckley

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

2021

True Few-Shot Learning with Language Models.

[BibT_eX]

[DOI]

,

,

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Rissanen Data Analysis: Examining Dataset Characteristics via Description Length.

[BibT_eX]

[DOI]

,

,

Proceedings of the 38th International Conference on Machine Learning, 2021

Case-based Reasoning for Natural Language Queries over Knowledge Bases.

[BibT_eX]

[DOI]

,

,

,

,

,

,

,

Lazaros Polymenakos

,

Andrew McCallum

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

2020

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

[BibT_eX]

[DOI]

Patrick S. H. Lewis

,

,

Aleksandra Piktus

,

,

Vladimir Karpukhin

,

,

Heinrich Küttler

,

,

,

Tim Rocktäschel

,

Sebastian Riedel

,

Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

Unsupervised Question Decomposition for Question Answering.

[BibT_eX]

[DOI]

,

Patrick S. H. Lewis

,

,

,

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

2019

Finding Generalizable Evidence by Learning to Convince Q&A Models.

[BibT_eX]

[DOI]

,

Siddharth Karamcheti

,

,

,

,

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019

ELI5: Long Form Question Answering.

[BibT_eX]

[DOI]

,

,

,

,

,

Proceedings of the 57th Conference of the Association for Computational Linguistics, 2019

2018

HoME: a Household Multimodal Environment.

[BibT_eX]

[DOI]

,

,

,

,

,

,

,

Hugo Larochelle

,

Aaron C. Courville

Proceedings of the 6th International Conference on Learning Representations, 2018

Visual Reasoning with Multi-hop Feature Modulation.

[BibT_eX]

[DOI]

,

,

,

,

,

,

Aaron C. Courville

,

Olivier Pietquin

Proceedings of the Computer Vision - ECCV 2018, 2018

FiLM: Visual Reasoning with a General Conditioning Layer.

[BibT_eX]

[DOI]

,

,

,

Vincent Dumoulin

,

Aaron C. Courville

Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018

2017

Learning Visual Reasoning Without Strong Priors.

[BibT_eX]

[DOI]

,

,

,

Vincent Dumoulin

,

Aaron C. Courville

CoRR, 2017

2016

Semi-Supervised Learning with the Deep Rendering Mixture Model.

[BibT_eX]

[DOI]

Minh Tan Nguyen

,

,

,

Richard G. Baraniuk

,

CoRR, 2016

Loading...