2025
Trading Inference-Time Compute for Adversarial Robustness.
,
,
,
,
,
,
,
,
,
,
CoRR, January, 2025
2024
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Deliberative Alignment: Reasoning Enables Safer Language Models.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Predicting Emergent Capabilities by Finetuning.
CoRR, 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.
CoRR, 2024
Unfamiliar Finetuning Examples Control How Language Models Hallucinate.
CoRR, 2024
Privacy Side Channels in Machine Learning Systems.
Proceedings of the 33rd USENIX Security Symposium, 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation.
Proceedings of the Forty-first International Conference on Machine Learning, 2024
Stealing part of a production language model.
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Forty-first International Conference on Machine Learning, 2024
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
The False Promise of Imitating Proprietary Language Models.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
What Evidence Do Language Models Find Convincing?
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
2023
Scalable Extraction of Training Data from (Production) Language Models.
CoRR, 2023
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore.
CoRR, 2023
The False Promise of Imitating Proprietary LLMs.
CoRR, 2023
Extracting Training Data from Diffusion Models.
Proceedings of the 32nd USENIX Security Symposium, 2023
Poisoning Language Models During Instruction Tuning.
Proceedings of the International Conference on Machine Learning, 2023
Large Language Models Struggle to Learn Long-Tail Knowledge.
Proceedings of the International Conference on Machine Learning, 2023
Measuring Forgetting of Memorized Training Examples.
,
,
,
,
,
,
,
,
,
,
Proceedings of the Eleventh International Conference on Learning Representations, 2023
InCoder: A Generative Model for Code Infilling and Synthesis.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
2022
Deduplicating Training Data Mitigates Privacy Risks in Language Models.
Proceedings of the International Conference on Machine Learning, 2022
Analyzing Dynamic Adversarial Training Data in the Limit.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, 2022
Automated Crossword Solving.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022
Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, 2022
2021
Calibrate Before Use: Improving Few-Shot Performance of Language Models.
CoRR, 2021
Extracting Training Data from Large Language Models.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 30th USENIX Security Symposium, 2021
Detoxifying Language Models Risks Marginalizing Minority Voices.
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021
Concealed Data Poisoning Attacks on NLP Models.
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021
Calibrate Before Use: Improving Few-shot Performance of Language Models.
Proceedings of the 38th International Conference on Machine Learning, 2021
2020
Customizing Triggers with Concealed Data Poisoning.
CoRR, 2020
Trustworthy AI Inference Systems: An Industry Research View.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2020
Evaluating NLP Models via Contrast Sets.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2020
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers.
CoRR, 2020
Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers.
Proceedings of the 37th International Conference on Machine Learning, 2020
Gradient-based Analysis of NLP Models is Manipulable.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, 2020
Imitation Attacks and Defenses for Black-box Machine Translation Systems.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020
Interpreting Predictions of NLP Models.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, 2020
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020
Evaluating Models' Local Decision Boundaries via Contrast Sets.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, 2020
Pretrained Transformers Improve Out-of-Distribution Robustness.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
2019
Trick Me If You Can: Human-in-the-loop Generation of Adversarial Question Answering Examples.
Trans. Assoc. Comput. Linguistics, 2019
Universal Adversarial Triggers for NLP.
CoRR, 2019
Understanding Impacts of High-Order Loss Approximations and Features in Deep Learning Interpretation.
Proceedings of the 36th International Conference on Machine Learning, 2019
Do NLP Models Know Numbers? Probing Numeracy in Embeddings.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019
AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019
Universal Adversarial Triggers for Attacking and Analyzing NLP.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019
Compositional Questions Do Not Necessitate Multi-hop Reasoning.
Proceedings of the 57th Conference of the Association for Computational Linguistics, 2019
Misleading Failures of Partial-input Baselines.
Proceedings of the 57th Conference of the Association for Computational Linguistics, 2019
2018
Trick Me If You Can: Adversarial Writing of Trivia Challenge Questions.
CoRR, 2018
Right Answer for the Wrong Reason: Discovery and Mitigation.
CoRR, 2018
Interpreting Neural Networks with Nearest Neighbors.
Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP, 2018
Pathologies of Neural Models Make Interpretation Difficult.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31, 2018
Trick Me If You Can: Adversarial Writing of Trivia Challenge Questions.
Proceedings of ACL 2018, Melbourne, Australia, July 15-20, 2018, Student Research Workshop, 2018