Jacob Steinhardt

CoRR, 2024

Monitoring Latent World States in Language Models with Propositional Probes.

[BibT_eX]

[DOI]

Jiahai Feng

Stuart Russell

CoRR, 2024

Adversaries Can Misuse Combinations of Safe Models.

[BibT_eX]

[DOI]

Erik Jones

Anca D. Dragan

CoRR, 2024

Interpreting the Second-Order Effects of Neurons in CLIP.

[BibT_eX]

[DOI]

Yossi Gandelsman

Alexei A. Efros

CoRR, 2024

Approaching Human-Level Forecasting with Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Feedback Loops With Language Models Drive In-Context Reward Hacking.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Overthinking the Truth: Understanding how Language Models Process False Demonstrations.

[BibT_eX]

[DOI]

Danny Halawi

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Interpreting CLIP's Image Representation via Text-Based Decomposition.

[BibT_eX]

[DOI]

Yossi Gandelsman

Alexei A. Efros

Proceedings of the Twelfth International Conference on Learning Representations, 2024

How do Language Models Bind Entities in Context?

[BibT_eX]

[DOI]

Jiahai Feng

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Describing Differences in Image Sets with Natural Language.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

Learning Equilibria in Matching Markets with Bandit Feedback.

[BibT_eX]

[DOI]

J. ACM, June, 2023

Incentivizing High-Quality Content in Online Recommender Systems.

[BibT_eX]

[DOI]

CoRR, 2023

Eliciting Latent Predictions from Transformers with the Tuned Lens.

[BibT_eX]

[DOI]

CoRR, 2023

Goal Driven Discovery of Distributional Differences via Language Descriptions.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Mass-Producing Failures of Multimodal Systems with Language Models.

[BibT_eX]

[DOI]

Shengbang Tong

Erik Jones

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Improved Bayes Risk Can Yield Reduced Social Welfare Under Competition.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Supply-Side Equilibria in Recommender Systems.

[BibT_eX]

[DOI]

Meena Jagadeesan

Nikhil Garg

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Jailbroken: How Does LLM Safety Training Fail?

[BibT_eX]

[DOI]

Alexander Wei

Nika Haghtalab

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Are Neurons Actually Collapsed? On the Fine-Grained Structure in Neural Representations.

[BibT_eX]

[DOI]

Yongyi Yang

Wei Hu

Proceedings of the International Conference on Machine Learning, 2023

Automatically Auditing Large Language Models via Discrete Optimization.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2023

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Progress measures for grokking via mechanistic interpretability.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Discovering Latent Knowledge in Language Models Without Supervision.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Reward Learning as Doubly Nonparametric Bandits: Optimal Design and Scaling Laws.

[BibT_eX]

[DOI]

Kush Bhatia

Wenshuo Guo

Proceedings of the International Conference on Artificial Intelligence and Statistics, 2023

2022

Stronger data poisoning attacks break data sanitization defenses.

[BibT_eX]

[DOI]

Pang Wei Koh

Mach. Learn., 2022

Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior.

[BibT_eX]

[DOI]

CoRR, 2022

Summarizing Differences between Text Distributions with Natural Language.

[BibT_eX]

[DOI]

CoRR, 2022

Forecasting Future World Events With Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Capturing Failures of Large Language Models via Human Cognitive Biases.

[BibT_eX]

[DOI]

Erik Jones

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Describing Differences between Text Distributions with Natural Language.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2022

Predicting Out-of-Distribution Error with the Projection Norm.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Machine Learning, 2022

Scaling Out-of-Distribution Detection for Real-World Settings.

[BibT_eX]

[DOI]

Mohammadreza Mostajabi

Dawn Song

Proceedings of the International Conference on Machine Learning, 2022

More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize.

[BibT_eX]

[DOI]

Alexander Wei

Wei Hu

Proceedings of the International Conference on Machine Learning, 2022

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models.

[BibT_eX]

[DOI]

Alexander Pan

Kush Bhatia

Proceedings of the Tenth International Conference on Learning Representations, 2022

A3D: Studying Pretrained Representations with Programmable Datasets.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

The Effect of Model Size on Worst-Group Generalization.

[BibT_eX]

[DOI]

CoRR, 2021

Unsolved Problems in ML Safety.

[BibT_eX]

[DOI]

CoRR, 2021

Grounding Representation Similarity with Statistical Testing.

[BibT_eX]

[DOI]

Frances Ding

CoRR, 2021

Understanding Generalization in Adversarial Training via the Bias-Variance Decomposition.

[BibT_eX]

[DOI]

CoRR, 2021

Approximating How Single Head Attention Learns.

[BibT_eX]

[DOI]

CoRR, 2021

Technical perspective: Robust statistics tackle new problems.

[BibT_eX]

[DOI]

Commun. ACM, 2021

Learning Equilibria in Matching Markets from Bandit Feedback.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

What Would Jiminy Cricket Do? Towards Agents That Behave Morally.

[BibT_eX]

[DOI]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, 2021

Measuring Coding Challenge Competence With APPS.

[BibT_eX]

[DOI]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, 2021

Measuring Mathematical Problem Solving With the MATH Dataset.

[BibT_eX]

[DOI]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, 2021

Grounding Representation Similarity Through Statistical Testing.

[BibT_eX]

[DOI]

Frances Ding

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Agnostic Learning with Unknown Utilities.

[BibT_eX]

[DOI]

Proceedings of the 12th Innovations in Theoretical Computer Science Conference, 2021

Measuring Massive Multitask Language Understanding.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

Aligning AI With Shared Human Values.

[BibT_eX]

[DOI]

Proceedings of the 9th International Conference on Learning Representations, 2021

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Natural Adversarial Examples.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Limitations of Post-Hoc Feature Alignment for Robustness.

[BibT_eX]

[DOI]

Collin Burns

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, 2021

2020

Robust estimation via generalized quasi-gradients.

[BibT_eX]

[DOI]

Banghua Zhu

Jiantao Jiao

CoRR, 2020

Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming.

[BibT_eX]

[DOI]

Sumanth Dathathri

Krishnamurthy Dvijotham

Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

When does the Tukey Median work?

[BibT_eX]

[DOI]

Banghua Zhu

Jiantao Jiao

Proceedings of the IEEE International Symposium on Information Theory, 2020

Rethinking Bias-Variance Trade-off for Generalization of Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the 37th International Conference on Machine Learning, 2020

Identifying Statistical Bias in Dataset Replication.

[BibT_eX]

[DOI]

Proceedings of the 37th International Conference on Machine Learning, 2020

2019

Troubling Trends in Machine Learning Scholarship.

[BibT_eX]

[DOI]

Zachary C. Lipton

ACM Queue, 2019

FrAngel: component-based synthesis with control structures.

[BibT_eX]

[DOI]

Kensen Shi

Proc. ACM Program. Lang., 2019

A Benchmark for Anomaly Segmentation.

[BibT_eX]

[DOI]

Dan Hendrycks

Steven Basart

Mantas Mazeika

Mohammadreza Mostajabi

Dawn Song

CoRR, 2019

Generalized Resilience and Robust Statistics.

[BibT_eX]

[DOI]

Banghua Zhu

Jiantao Jiao

CoRR, 2019

Testing Robustness Against Unforeseen Adversaries.

[BibT_eX]

[DOI]

CoRR, 2019

Transfer of Adversarial Robustness Between Perturbation Types.

[BibT_eX]

[DOI]

CoRR, 2019

Research for practice: troubling trends in machine-learning scholarship.

[BibT_eX]

[DOI]

Zachary C. Lipton

Commun. ACM, 2019

Sever: A Robust Meta-Algorithm for Stochastic Optimization.

[BibT_eX]

[DOI]

Proceedings of the 36th International Conference on Machine Learning, 2019

2018

Robust learning: information theory and algorithms.

[BibT_eX]

[DOI]

PhD thesis, 2018

The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation.

[BibT_eX]

[DOI]

CoRR, 2018

Robust moment estimation and improved clustering via sum of squares.

[BibT_eX]

[DOI]

Pravesh K. Kothari

David Steurer

Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018

Semidefinite relaxations for certifying robustness to adversarial examples.

[BibT_eX]

[DOI]

Aditi Raghunathan

Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 2018

Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers.

[BibT_eX]

[DOI]

Moses Charikar

Proceedings of the 9th Innovations in Theoretical Computer Science Conference, 2018

Certified Defenses against Adversarial Examples.

[BibT_eX]

[DOI]

Aditi Raghunathan

Proceedings of the 6th International Conference on Learning Representations, 2018

2017

Does robustness imply tractability? A lower bound for planted clique in the semi-random model.

[BibT_eX]

[DOI]

Electron. Colloquium Comput. Complex., 2017

Better Agnostic Clustering Via Relaxed Tensor Norms.

[BibT_eX]

[DOI]

Pravesh K. Kothari

CoRR, 2017

Learning from untrusted data.

[BibT_eX]

[DOI]

Moses Charikar

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, 2017

Certified Defenses for Data Poisoning Attacks.

[BibT_eX]

[DOI]

Pang Wei Koh

Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 2017

2016

Concrete Problems in AI Safety.

[BibT_eX]

[DOI]

CoRR, 2016

Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction.

[BibT_eX]

[DOI]

Moses Charikar

Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, 2016

Unsupervised Risk Estimation Using Only Conditional Independence Structure.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, 2016

2015

Memory, Communication, and Statistical Queries.

[BibT_eX]

[DOI]

Stefan Wager

Electron. Colloquium Comput. Complex., 2015

Learning with Relaxed Supervision.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, 2015

Learning Fast-Mixing Models for Structured Prediction.

[BibT_eX]

[DOI]

Proceedings of the 32nd International Conference on Machine Learning, 2015

Reified Context Models.

[BibT_eX]

[DOI]

Proceedings of the 32nd International Conference on Machine Learning, 2015

Minimax rates for memory-bounded sparse linear regression.

[BibT_eX]

[DOI]

John C. Duchi

Proceedings of The 28th Conference on Learning Theory, 2015

Learning Where to Sample in Structured Prediction.

[BibT_eX]

[DOI]

Tianlin Shi

Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 2015

2014

The Statistics of Streaming Sparse Regression.

[BibT_eX]

[DOI]

Stefan Wager

CoRR, 2014

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm.

[BibT_eX]

[DOI]

Proceedings of the 31th International Conference on Machine Learning, 2014

Filtering with Abstract Particles.

[BibT_eX]

[DOI]

Proceedings of the 31th International Conference on Machine Learning, 2014

2012

Flexible Martingale Priors for Deep Hierarchies.

[BibT_eX]

[DOI]

Zoubin Ghahramani

Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, 2012

Finite-time regional verification of stochastic non-linear systems.

[BibT_eX]

[DOI]

Russ Tedrake

Int. J. Robotics Res., 2012

2011

Finite-Time Regional Verification of Stochastic Nonlinear Systems.

[BibT_eX]

[DOI]

Russ Tedrake

Proceedings of the Robotics: Science and Systems VII, 2011

2010

Permutations with Ascending and Descending Blocks.

[BibT_eX]

[DOI]

Electron. J. Comb., 2010

2009

On Coloring the Odd-Distance Graph.

[BibT_eX]

[DOI]