Andy Zou

According to our database1, Andy Zou authored at least 22 papers between 2021 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents.
CoRR, 2024

Tamper-Resistant Safeguards for Open-Weight LLMs.
CoRR, 2024

Improving Alignment and Robustness with Circuit Breakers.
CoRR, 2024

Lessons from the Trenches on Reproducible Evaluation of Language Models.
CoRR, 2024

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning.
CoRR, 2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.
Proceedings of the Forty-first International Conference on Machine Learning, 2024


2023
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models.
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
Trans. Mach. Learn. Res., 2023

Representation Engineering: A Top-Down Approach to AI Transparency.
CoRR, 2023

Universal and Transferable Adversarial Attacks on Aligned Language Models.
CoRR, 2023

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark.
CoRR, 2023

Papaya: Federated Learning, but Fully Decentralized.
CoRR, 2023

Scaling in Depth: Unlocking Robustness Certification on ImageNet.
CoRR, 2023

Unlocking Deterministic Robustness Certification on ImageNet.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark.
Proceedings of the International Conference on Machine Learning, 2023

2022
Forecasting Future World Events With Neural Networks.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Scaling Out-of-Distribution Detection for Real-World Settings.
Proceedings of the International Conference on Machine Learning, 2022

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

What Would Jiminy Cricket Do? Towards Agents That Behave Morally.
Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, 2021

Measuring Massive Multitask Language Understanding.
Proceedings of the 9th International Conference on Learning Representations, 2021


  Loading...