Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding.
CoRR, April, 2025
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer.
CoRR, April, 2025
Grounding Multimodal Large Language Model in GUI World.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
ShowUI: One Vision-Language-Action Model for GUI Visual Agent.
CoRR, 2024
VIT-LENS: Towards Omni-modal Representations.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
PCCT: Progressive Class-Center Triplet Loss for Imbalanced Medical Image Classification.
IEEE J. Biomed. Health Informatics, April, 2023
ViT-Lens-2: Gateway to Omni-modal Intelligence.
CoRR, 2023
ViT-Lens: Towards Omni-modal Representations.
CoRR, 2023
Learning to Learn: How to Continuously Teach Humans and Machines.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation.
CoRR, 2022
Learning to Learn: How to Continuously Teach Humans and Machines.
CoRR, 2022
PCCT: Progressive Class-Center Triplet Loss for Imbalanced Medical Image Classification.
CoRR, 2022
AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022
GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval.
Proceedings of the Computer Vision - ECCV 2022, 2022
Class-Center Involved Triplet Loss for Skin Disease Classification on Imbalanced Data.
Proceedings of the 17th IEEE International Symposium on Biomedical Imaging, 2020