RTime-QA: A Benchmark for Atomic Temporal Event Understanding in Large Multi-modal Models.
CoRR, May, 2025
VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning.
CoRR, May, 2025
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
CoRR, March, 2025
FILP-3D: Enhancing 3D few-shot class-incremental learning with pre-trained vision-language models.
Pattern Recognit., 2025
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
An Improved Baseline for Reasoning Segmentation with Large Language Model.
CoRR, 2023