Unified Multimodal Understanding via Byte-Pair Visual Encoding.
CoRR, June, 2025
RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control.
CoRR, June, 2025
EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining.
CoRR, March, 2025
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning.
CoRR, March, 2025
Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
VideoOrion: Tokenizing Object Dynamics in Videos.
CoRR, 2024
Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models.
CoRR, 2024
QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds.
CoRR, 2024
EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?
CoRR, 2024
SPAFormer: Sequential 3D Part Assembly with Transformers.
CoRR, 2024
LLaMA-Rider: Spurring Large Language Models to Explore the Open World.
Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, 2024
Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
UniCode: Learning a Unified Codebook for Multimodal Large Language Models.
Proceedings of the Computer Vision - ECCV 2024, 2024
No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection.
CoRR, 2023
POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World.
Proceedings of the 31st ACM International Conference on Multimedia, 2023
Anchor-Based Detection for Natural Language Localization in Ego-Centric Videos.
Proceedings of the IEEE International Conference on Consumer Electronics, 2023
Open-Category Human-Object Interaction Pre-training via Language Modeling Framework.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
Accommodating Audio Modality in CLIP for Multimodal Processing.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023
Exploring Anchor-based Detection for Ego4D Natural Language Query.
CoRR, 2022
Few-Shot Action Recognition with Hierarchical Matching and Contrastive Learning.
Proceedings of the Computer Vision - ECCV 2022, 2022
VRDFormer: End-to-End Video Visual Relation Detection with Transformers.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
MR imaging for the quantitative assessment of brain iron in aceruloplasminemia: A postmortem validation study.
,
,
,
,
,
,
,
,
,
,
NeuroImage, 2021
Skeleton-Based Interactive Graph Network For Human Object Interaction Detection.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2020
Visual Relation Detection with Multi-Level Attention.
Proceedings of the 27th ACM International Conference on Multimedia, 2019
Relation Understanding in Videos.
Proceedings of the 27th ACM International Conference on Multimedia, 2019