2025
RTime-QA: A Benchmark for Atomic Temporal Event Understanding in Large Multi-modal Models.
CoRR, May, 2025

VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning.
CoRR, May, 2025

Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
CoRR, March, 2025

FILP-3D: Enhancing 3D few-shot class-incremental learning with pre-trained vision-language models.
Pattern Recognit., 2025

2024
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition.
CoRR, 2024

2023
An Improved Baseline for Reasoning Segmentation with Large Language Model.
CoRR, 2023