2025
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents.
CoRR, June, 2025

SITE: towards Spatial Intelligence Thorough Evaluation.
CoRR, May, 2025

Magma: A Foundation Model for Multimodal AI Agents.
CoRR, February, 2025

Latent Action Pretraining from Videos.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
SAT: Spatial Aptitude Training for Multimodal Language Models.
CoRR, 2024

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models.
CoRR, 2024

Koala: Key Frame-Conditioned Long Video-LLM.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023
Socratis: Are large multimodal models emotionally aware?
CoRR, 2023

Multiscale Video Pretraining for Long-Term Activity Forecasting.
CoRR, 2023

EgoAdapt: A multi-stream evaluation study of adaptation to real-world egocentric user video.
CoRR, 2023

Language-Guided Audio-Visual Source Separation via Trimodal Consistency.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022
NewsStories: Illustrating Articles with Visual Summaries.
Proceedings of the Computer Vision - ECCV 2022, 2022

2021
LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval.
Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

2020
Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

2019
wMAN: Weakly-supervised Moment Alignment Network for Text-based Video Segment Retrieval.
CoRR, 2019

Learning Similarity Conditions Without Explicit Supervision.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Language Features Matter: Effective Language Representations for Vision-Language Tasks.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019