2025

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents.

[DOI]

Qianhui Wu

Kanzhi Cheng

CoRR, June, 2025

SITE: towards Spatial Intelligence Thorough Evaluation.

[DOI]

CoRR, May, 2025

Magma: A Foundation Model for Multimodal AI Agents.

[DOI]

CoRR, February, 2025

Latent Action Pretraining from Videos.

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024

SAT: Spatial Aptitude Training for Multimodal Language Models.

[DOI]

CoRR, 2024

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models.

[DOI]

CoRR, 2024

Koala: Key Frame-Conditioned Long Video-LLM.

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

Socratis: Are large multimodal models emotionally aware?

[DOI]

CoRR, 2023

Multiscale Video Pretraining for Long-Term Activity Forecasting.

[DOI]

CoRR, 2023

EgoAdapt: A multi-stream evaluation study of adaptation to real-world egocentric user video.

[DOI]

CoRR, 2023

Language-Guided Audio-Visual Source Separation via Trimodal Consistency.

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022

NewsStories: Illustrating Articles with Visual Summaries.

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

2021

LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval.

[DOI]

Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

2020

Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News.

[DOI]

Reuben Tan

Bryan A. Plummer

Kate Saenko

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

2019

wMAN: Weakly-supervised Moment Alignment Network for Text-based Video Segment Retrieval.

[DOI]

CoRR, 2019

Learning Similarity Conditions Without Explicit Supervision.

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Language Features Matter: Effective Language Representations for Vision-Language Tasks.

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019