2025
HCAST: Human-Calibrated Autonomy Software Tasks.
CoRR, March, 2025

Measuring AI Ability to Complete Long Tasks.
CoRR, March, 2025

2024
Training Language Models to Win Debates with Self-Play Improves Judge Accuracy.
CoRR, 2024

2023
GPQA: A Graduate-Level Google-Proof Q&A Benchmark.
CoRR, 2023

Debate Helps Supervise Unreliable Experts.
CoRR, 2023

2021
Drusen segmentation with sparse volumetric SD-OCT sampling.
Proceedings of the Medical Imaging 2021: Image Processing, Online, February 15-19, 2021, 2021

Classification with Strategically Withheld Data.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020
Detecting age-related macular degeneration (AMD) biomarker images using MFCC and texture features.
Proceedings of the Medical Imaging 2020: Computer-Aided Diagnosis, 2020