2025
Visual Agentic Reinforcement Fine-Tuning.
CoRR, May, 2025
MM-IFEngine: Towards Multimodal Instruction Following.
CoRR, April, 2025
HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance.
CoRR, April, 2025
Visual-RFT: Visual Reinforcement Fine-Tuning.
CoRR, March, 2025
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation.
CoRR, February, 2025
Maximum Entropy Reinforcement Learning with Diffusion Policy.
CoRR, February, 2025
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion.
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, February, 2025
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
,
,
,
,
,
,
,
,
,
,
,
CoRR, February, 2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model.
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, January, 2025
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, January, 2025
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning.
CoRR, January, 2025
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction.
CoRR, January, 2025
2024
PersonMAE: Person Re-Identification Pre-Training With Masked AutoEncoders.
IEEE Trans. Multim., 2024
PointCAT: Contrastive Adversarial Training for Robust Point Cloud Recognition.
IEEE Trans. Image Process., 2024
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models.
CoRR, 2024
Open-Sora Plan: Open-Source Large Video Generation Model.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models.
CoRR, 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction.
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree.
CoRR, 2024
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate.
CoRR, 2024
BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way.
CoRR, 2024
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
MotionClone: Training-Free Motion Cloning for Controllable Video Generation.
CoRR, 2024
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Bootstrap3D: Improving 3D Content Creation with Synthetic Data.
CoRR, 2024
ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing.
CoRR, 2024
Unified Scene Representation and Reconstruction for 3D Large Language Models.
CoRR, 2024
Are We on the Right Way for Evaluating Large Vision-Language Models?
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
InternLM2 Technical Report.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
et al.
CoRR, 2024
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition.
CoRR, 2024
SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation.
CoRR, 2024
DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models.
CoRR, 2024
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Sci. China Inf. Sci., 2024
Streaming Long Video Understanding with Large Language Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs.
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Are We on the Right Way for Evaluating Large Vision-Language Models?
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
VLMEvalKit: An Open-Source ToolKit for Evaluating Large Multi-Modality Models.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024
Long-CLIP: Unlocking the Long-Text Capability of CLIP.
Proceedings of the Computer Vision - ECCV 2024, 2024
ShareGPT4V: Improving Large Multi-modal Models with Better Captions.
Proceedings of the Computer Vision - ECCV 2024, 2024
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
VIGC: Visual Instruction Generation and Correction.
,
,
,
,
,
,
,
,
,
,
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024
2023
Feature Fusion Based Adversarial Example Detection Against Second-Round Adversarial Attacks.
IEEE Trans. Artif. Intell., October, 2023
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization.
CoRR, 2023
Emotional Listener Portrait: Neural Listener Head Generation with Emotion.
CoRR, 2023
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2023
MLLM-DataEngine: An Iterative Refinement Approach for MLLM.
CoRR, 2023
Emotional Listener Portrait: Realistic Listener Motion Simulation in Conversation.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
NTIRE 2023 Image Shadow Removal Challenge Report.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
Diversity-Aware Meta Visual Prompting.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023
2022
CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet.
CoRR, 2022
PointCAT: Contrastive Adversarial Training for Robust Point Cloud Recognition.
CoRR, 2022
Protecting Celebrities with Identity Consistency Transformer.
CoRR, 2022
Adaptive Face Forgery Detection in Cross Domain.
Proceedings of the Computer Vision - ECCV 2022, 2022
RISPNet: A Network for Reversed Image Signal Processing.
Proceedings of the Computer Vision - ECCV 2022 Workshops, 2022
Bootstrapped Masked Autoencoders for Vision BERT Pretraining.
Proceedings of the Computer Vision - ECCV 2022, 2022
Shape-invariant 3D Adversarial Point Clouds.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
Protecting Celebrities from DeepFake with Identity Consistency Transformer.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
Mobile-Former: Bridging MobileNet and Transformer.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
2021
Local Geometric Distortions Resilient Watermarking Scheme Based on Symmetry.
IEEE Trans. Circuits Syst. Video Technol., 2021
Adversarial steganography based on sparse cover enhancement.
J. Vis. Commun. Image Represent., 2021
Adversarial defense via self-orthogonal randomization super-network.
Neurocomputing, 2021
TACR-Net: Editing on Deep Video and Voice Portraits.
Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021
2020
Identity-Driven DeepFake Detection.
CoRR, 2020
GreedyFool: Distortion-Aware Sparse Adversarial Attack.
Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020
LG-GAN: Label Guided Adversarial Network for Flexible Targeted Attack of Point Cloud Based Deep Networks.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020
Robust Superpixel-Guided Attentional Adversarial Attack.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020
Self-Robust 3D Point Recognition via Gather-Vector Guidance.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020
2019
Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019
2018
CAAD 2018: Powerful None-Access Black-Box Attack Based on Adversarial Transformation Network.
CoRR, 2018
2008
Microstructured optical fiber Bragg gratings and their applications.
Proceedings of the 2008 International Conference on Advanced Infocomm Technology, 2008