GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, June, 2025
SITE: towards Spatial Intelligence Thorough Evaluation.
CoRR, May, 2025
Magma: A Foundation Model for Multimodal AI Agents.
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, February, 2025
Latent Action Pretraining from Videos.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
SAT: Spatial Aptitude Training for Multimodal Language Models.
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Koala: Key Frame-Conditioned Long Video-LLM.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
Socratis: Are large multimodal models emotionally aware?
CoRR, 2023
Multiscale Video Pretraining for Long-Term Activity Forecasting.
CoRR, 2023
EgoAdapt: A multi-stream evaluation study of adaptation to real-world egocentric user video.
CoRR, 2023
Language-Guided Audio-Visual Source Separation via Trimodal Consistency.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
NewsStories: Illustrating Articles with Visual Summaries.
Proceedings of the Computer Vision - ECCV 2022, 2022
LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval.
Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021
Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021
Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020
wMAN: Weakly-supervised Moment Alignment Network for Text-based Video Segment Retrieval.
CoRR, 2019
Learning Similarity Conditions Without Explicit Supervision.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019
Language Features Matter: Effective Language Representations for Vision-Language Tasks.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019