Andy Zou

According to our database¹, Andy Zou authored at least 22 papers between 2021 and 2024.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2024

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents.

[BibT_eX]

[DOI]

Maksym Andriushchenko

CoRR, 2024

Tamper-Resistant Safeguards for Open-Weight LLMs.

[BibT_eX]

[DOI]

CoRR, 2024

Improving Alignment and Robustness with Circuit Breakers.

[BibT_eX]

[DOI]

Maksym Andriushchenko

CoRR, 2024

Lessons from the Trenches on Reproducible Evaluation of Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning.

[BibT_eX]

[DOI]

Ann-Kathrin Dombrowski

Justin Tienken-Harder

Kallol Krishna Karmakar

Steven Basart

Stephen Fitz

Mindy Levine

Ponnurangam Kumaraguru

CoRR, 2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

2023

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models.

[BibT_eX]

[DOI]

Bartlomiej Bojanowski

Christopher D. Manning

Daniel Moseguí González

Eunice Engefu Manyasi

Evgenii Zheltonozhskii

Fanyue Xia

Fatemeh Siar

Fernando Martínez-Plumed

Giambattista Parascandolo

Giorgio Mariani

Gloria Wang

Gonzalo Jaimovitch-López

Jaime Fernández Fisac

Jascha Sohl-Dickstein

José Hernández-Orallo

Karthik Gopalakrishnan

Lidia Contreras Ochando

Louis-Philippe Morency

María José Ramírez-Quintana

Michael I. Ivanitskiy

Neta Gur-Ari Krakover

Nitish Shirish Keskar

Pablo Antonio Moreno Casares

Pegah Alipoormolabashi

Shyamolima (Shammie) Debnath

Sneha Priscilla Makini

Yadollah Yaghoobzadeh

Trans. Mach. Learn. Res., 2023

Representation Engineering: A Top-Down Approach to AI Transparency.

[BibT_eX]

[DOI]

CoRR, 2023

Universal and Transferable Adversarial Attacks on Aligned Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark.

[BibT_eX]

[DOI]

CoRR, 2023

Papaya: Federated Learning, but Fully Decentralized.

[BibT_eX]

[DOI]

CoRR, 2023

Scaling in Depth: Unlocking Robustness Certification on ImageNet.

[BibT_eX]

[DOI]

CoRR, 2023

Unlocking Deterministic Robustness Certification on ImageNet.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

2022

Forecasting Future World Events With Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Scaling Out-of-Distribution Detection for Real-World Settings.

[BibT_eX]

[DOI]

Mohammadreza Mostajabi

Jacob Steinhardt

Dawn Song

Proceedings of the International Conference on Machine Learning, 2022

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

The Trojan Detection Challenge.

[BibT_eX]

[DOI]

Proceedings of the NeurIPS 2022 Competition Track, 2021

What Would Jiminy Cricket Do? Towards Agents That Behave Morally.

[BibT_eX]

[DOI]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, 2021

Measuring Massive Multitask Language Understanding.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

Andy Zou

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...