2025
Weight Ensembling Improves Reasoning in Language Models.
CoRR, April, 2025

Overtrained Language Models Are Harder to Fine-Tune.
CoRR, March, 2025

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images.
CoRR, February, 2025

Task Generalization With AutoRegressive Compositional Structure: Can Learning From <i>D</i> Tasks Generalize to <i>D</i><sup>T</sup> Tasks?
CoRR, February, 2025

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models.
CoRR, January, 2025

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

RNNs are not Transformers (Yet): The Key Bottleneck on In-Context Retrieval.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape View.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective.
CoRR, 2024

2023
Practically Solving LPN in High Noise Regimes Faster Using Neural Networks.
IACR Cryptol. ePrint Arch., 2023

Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Benign Overfitting in Classification: Provably Counter Label Noise with Larger Models.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

How Sharpness-Aware Minimization Minimizes Sharpness?
Proceedings of the Eleventh International Conference on Learning Representations, 2023

2022
How Does Sharpness-Aware Minimization Minimize Sharpness?
CoRR, 2022

Realistic Deep Learning May Not Fit Benignly.
CoRR, 2022

On Transferability of Prompt Tuning for Natural Language Processing.
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

Finding Skill Neurons in Pre-trained Transformer-based Language Models.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022