BEARCUBS: A benchmark for computer-using web agents.
CoRR, March, 2025
CLIPPER: Compression enables long-context synthetic data generation.
CoRR, February, 2025
FABLES: Evaluating faithfulness and content selection in book-length summarization.
CoRR, 2024
BooookScore: A systematic exploration of book-length summarization in the era of LLMs.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
PostMark: A Robust Blackbox Watermark for Large Language Models.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
RankGen: Improving Text Generation with Large Ranking Models.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
RELiC: Retrieving Evidence for Literary Claims.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022