2025
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding.
CoRR, April, 2025

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer.
CoRR, April, 2025

Grounding Multimodal Large Language Model in GUI World.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
ShowUI: One Vision-Language-Action Model for GUI Visual Agent.
CoRR, 2024

VIT-LENS: Towards Omni-modal Representations.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023
PCCT: Progressive Class-Center Triplet Loss for Imbalanced Medical Image Classification.
IEEE J. Biomed. Health Informatics, April, 2023

ViT-Lens-2: Gateway to Omni-modal Intelligence.
CoRR, 2023

ViT-Lens: Towards Omni-modal Representations.
CoRR, 2023

Learning to Learn: How to Continuously Teach Humans and Machines.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2022
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation.
CoRR, 2022

Learning to Learn: How to Continuously Teach Humans and Machines.
CoRR, 2022

PCCT: Progressive Class-Center Triplet Loss for Imbalanced Medical Image Classification.
CoRR, 2022

AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval.
Proceedings of the Computer Vision - ECCV 2022, 2022

2020
Class-Center Involved Triplet Loss for Skin Disease Classification on Imbalanced Data.
Proceedings of the 17th IEEE International Symposium on Biomedical Imaging, 2020