2025

Visual Agentic Reinforcement Fine-Tuning.

[DOI]

Ziyu Liu

Yuhang Zang

CoRR, May, 2025

MM-IFEngine: Towards Multimodal Instruction Following.

[DOI]

CoRR, April, 2025

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance.

[DOI]

CoRR, April, 2025

Visual-RFT: Visual Reinforcement Fine-Tuning.

[DOI]

CoRR, March, 2025

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation.

[DOI]

CoRR, February, 2025

Maximum Entropy Reinforcement Learning with Diffusion Policy.

[DOI]

Xiaoyi Dong

Jian Cheng

Xi Sheryl Zhang

CoRR, February, 2025

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion.

[DOI]

CoRR, February, 2025

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

[DOI]

CoRR, February, 2025

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model.

[DOI]

CoRR, January, 2025

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

[DOI]

CoRR, January, 2025

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning.

[DOI]

CoRR, January, 2025

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction.

[DOI]

CoRR, January, 2025

2024

PersonMAE: Person Re-Identification Pre-Training With Masked AutoEncoders.

[DOI]

IEEE Trans. Multim., 2024

PointCAT: Contrastive Adversarial Training for Robust Point Cloud Recognition.

[DOI]

IEEE Trans. Image Process., 2024

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions.

[DOI]

CoRR, 2024

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models.

[DOI]

CoRR, 2024

Open-Sora Plan: Open-Source Large Video Generation Model.

[DOI]

CoRR, 2024

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models.

[DOI]

CoRR, 2024

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction.

[DOI]

CoRR, 2024

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree.

[DOI]

CoRR, 2024

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate.

[DOI]

CoRR, 2024

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way.

[DOI]

CoRR, 2024

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output.

[DOI]

CoRR, 2024

V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results.

[DOI]

CoRR, 2024

MotionClone: Training-Free Motion Cloning for Controllable Video Generation.

[DOI]

CoRR, 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.

[DOI]

CoRR, 2024

Bootstrap3D: Improving 3D Content Creation with Synthetic Data.

[DOI]

CoRR, 2024

ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing.

[DOI]

CoRR, 2024

Unified Scene Representation and Reconstruction for 3D Large Language Models.

[DOI]

CoRR, 2024

Are We on the Right Way for Evaluating Large Vision-Language Models?

[DOI]

CoRR, 2024

InternLM2 Technical Report.

[DOI]

et al.

CoRR, 2024

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition.

[DOI]

CoRR, 2024

SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation.

[DOI]

CoRR, 2024

DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models.

[DOI]

CoRR, 2024

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model.

[DOI]

CoRR, 2024

How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites.

[DOI]

Sci. China Inf. Sci., 2024

Streaming Long Video Understanding with Large Language Models.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Are We on the Right Way for Evaluating Large Vision-Language Models?

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

VLMEvalKit: An Open-Source ToolKit for Evaluating Large Multi-Modality Models.

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Long-CLIP: Unlocking the Long-Text Capability of CLIP.

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

ShareGPT4V: Improving Large Multi-modal Models with Better Captions.

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation.

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

VIGC: Visual Instruction Generation and Correction.

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

Feature Fusion Based Adversarial Example Detection Against Second-Round Adversarial Attacks.

[DOI]

IEEE Trans. Artif. Intell., October, 2023

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization.

[DOI]

CoRR, 2023

Emotional Listener Portrait: Neural Listener Head Generation with Emotion.

[DOI]

CoRR, 2023

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition.

[DOI]

CoRR, 2023

MLLM-DataEngine: An Iterative Refinement Approach for MLLM.

[DOI]

CoRR, 2023

Emotional Listener Portrait: Realistic Listener Motion Simulation in Conversation.

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting.

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

NTIRE 2023 Image Shadow Removal Challenge Report.

[DOI]

Florin-Alexandru Vasluianu

Fredrik K. Gustafsson

Santosh Kumar Vipparthi

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Diversity-Aware Meta Visual Prompting.

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining.

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers.

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022

CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet.

[DOI]

CoRR, 2022

PointCAT: Contrastive Adversarial Training for Robust Point Cloud Recognition.

[DOI]

CoRR, 2022

Protecting Celebrities with Identity Consistency Transformer.

[DOI]

CoRR, 2022

Adaptive Face Forgery Detection in Cross Domain.

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

RISPNet: A Network for Reversed Image Signal Processing.

[DOI]

Proceedings of the Computer Vision - ECCV 2022 Workshops, 2022

Bootstrapped Masked Autoencoders for Vision BERT Pretraining.

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Shape-invariant 3D Adversarial Point Clouds.

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows.

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Protecting Celebrities from DeepFake with Identity Consistency Transformer.

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Mobile-Former: Bridging MobileNet and Transformer.

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

Local Geometric Distortions Resilient Watermarking Scheme Based on Symmetry.

[DOI]

IEEE Trans. Circuits Syst. Video Technol., 2021

Adversarial steganography based on sparse cover enhancement.

[DOI]

J. Vis. Commun. Image Represent., 2021

Adversarial defense via self-orthogonal randomization super-network.

[DOI]

Neurocomputing, 2021

TACR-Net: Editing on Deep Video and Voice Portraits.

[DOI]

Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

2020

Identity-Driven DeepFake Detection.

[DOI]

CoRR, 2020

GreedyFool: Distortion-Aware Sparse Adversarial Attack.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

LG-GAN: Label Guided Adversarial Network for Flexible Targeted Attack of Point Cloud Based Deep Networks.

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Robust Superpixel-Guided Attentional Adversarial Attack.

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Self-Robust 3D Point Recognition via Gather-Vector Guidance.

[DOI]

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019

Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once.

[DOI]

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

2018

CAAD 2018: Powerful None-Access Black-Box Attack Based on Adversarial Transformation Network.

[DOI]

Xiaoyi Dong

Weiming Zhang

Nenghai Yu

CoRR, 2018

2008

Microstructured optical fiber Bragg gratings and their applications.

[DOI]

Proceedings of the 2008 International Conference on Advanced Infocomm Technology, 2008