2025
Decomposed Prototype Learning for Few-Shot Scene Graph Generation.
ACM Trans. Multim. Comput. Commun. Appl., January, 2025
ENCODE: Breaking the Trade-Off Between Performance and Efficiency in Long-Term User Behavior Modeling.
IEEE Trans. Knowl. Data Eng., January, 2025
Learning Combinatorial Prompts for Universal Controllable Image Captioning.
Int. J. Comput. Vis., January, 2025
From Easy to Hard: Learning Curricular Shape-Aware Features for Robust Panoptic Scene Graph Generation.
Int. J. Comput. Vis., January, 2025
Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation.
CoRR, January, 2025
2024
Improving Reference-Based Distinctive Image Captioning with Contrastive Rewards.
ACM Trans. Multim. Comput. Commun. Appl., December, 2024
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future.
IEEE Trans. Pattern Anal. Mach. Intell., December, 2024
NICEST: Noisy Label Correction and Training for Robust Scene Graph Generation.
IEEE Trans. Pattern Anal. Mach. Intell., October, 2024
CrossFormer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention.
IEEE Trans. Pattern Anal. Mach. Intell., May, 2024
Label Semantic Knowledge Distillation for Unbiased Scene Graph Generation.
IEEE Trans. Circuits Syst. Video Technol., January, 2024
In Defense of Clip-Based Video Relation Detection.
IEEE Trans. Image Process., 2024
GSSF: Generalized Structural Sparse Function for Deep Cross-Modal Metric Learning.
IEEE Trans. Image Process., 2024
Learning Causal Transition Matrix for Instance-dependent Label Noise.
CoRR, 2024
Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing.
CoRR, 2024
IterIS: Iterative Inference-Solving Alignment for LoRA Merging.
CoRR, 2024
R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback.
CoRR, 2024
A Comprehensive Survey of Datasets, Theories, Variants, and Applications in Direct Preference Optimization.
CoRR, 2024
Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing.
CoRR, 2024
Event-Customized Image Generation.
CoRR, 2024
A Survey on Multimodal Benchmarks: In the Era of Large AI Models.
CoRR, 2024
Improving Diffusion-based Data Augmentation with Inversion Spherical Interpolation.
CoRR, 2024
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding.
CoRR, 2024
MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding.
CoRR, 2024
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation.
CoRR, 2024
Di<sup>2</sup>Pose: Discrete Diffusion Model for Occluded 3D Human Pose Estimation.
CoRR, 2024
FreeTuner: Any Subject in Any Style with Training-free Diffusion.
CoRR, 2024
Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation.
CoRR, 2024
Boundary and Relation Distillation for Semantic Segmentation.
CoRR, 2024
LLMs Can Evolve Continually on Modality for X-Modal Reasoning.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
$\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
The 2nd International Workshop on Deep Multi-modal Generation and Retrieval.
Proceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval, 2024
PROMOTE: Prior-Guided Diffusion Model with Global-Local Contrastive Learning for Exemplar-Based Image Translation.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024
Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024
Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning.
Proceedings of the 2024 International Conference on Multimedia Retrieval, 2024
ClothPPO: A Proximal Policy Optimization Enhancing Framework for Robotic Cloth Manipulation with Observation-Aligned Action Spaces.
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024
SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
Mrtnet: Multi-Resolution Temporal Network for Video Sentence Grounding.
Proceedings of the IEEE International Conference on Acoustics, 2024
MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
View-Consistent 3D Editing with Gaussian Splatting.
Proceedings of the Computer Vision - ECCV 2024, 2024
DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism.
Proceedings of the Computer Vision - ECCV 2024, 2024
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning.
Proceedings of the Computer Vision - ECCV 2024, 2024
An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding.
Proceedings of the Computer Vision - ECCV 2024, 2024
Distributionally Generative Augmentation for Fair Facial Attribute Classification.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter.
Proceedings of the Findings of the Association for Computational Linguistics, 2024
Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities.
,
,
,
,
,
,
,
,
,
,
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024
2023
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach.
ACM Trans. Multim. Comput. Commun. Appl., November, 2023
Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering.
IEEE Trans. Pattern Anal. Mach. Intell., November, 2023
Federated unsupervised representation learning.
,
,
,
,
,
,
,
,
,
,
Frontiers Inf. Technol. Electron. Eng., August, 2023
VL-NMS: Breaking Proposal Bottlenecks in Two-stage Visual-language Matching.
ACM Trans. Multim. Comput. Commun. Appl., 2023
DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism.
CoRR, 2023
Compositional Zero-shot Learning via Progressive Language-based Observations.
CoRR, 2023
Video Referring Expression Comprehension via Transformer with Content-conditioned Query.
CoRR, 2023
MEDOE: A Multi-Expert Decoder and Output Ensemble Framework for Long-tailed Semantic Segmentation.
CoRR, 2023
Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs.
CoRR, 2023
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding.
CoRR, 2023
Decomposed Prototype Learning for Few-Shot Scene Graph Generation.
CoRR, 2023
Learning Combinatorial Prompts for Universal Controllable Image Captioning.
CoRR, 2023
Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
Two Heads are Better Than One: A Simple Exploration Framework for Efficient Multi-Agent Reinforcement Learning.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
Video Referring Expression Comprehension via Transformer with Content-conditioned Query.
Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval, 2023
Discrepancy-Guided Reconstruction Learning for Image Forgery Detection.
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023
Fairness-aware Contrastive Learning with Partially Annotated Sensitive Attributes.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
TempCLR: Temporal Alignment Representation with Contrastive Learning.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
Video Scene Graph Generation from Single-Frame Weak Supervision.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
Compositional Feature Augmentation for Unbiased Scene Graph Generation.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023
Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023
Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023
Iterative Proposal Refinement for Weakly-Supervised Video Grounding.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
Enhanced Chart Understanding via Visual Language Pre-training on Plot Table Pairs.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023
2022
Deep Motion Prior for Weakly-Supervised Temporal Action Localization.
IEEE Trans. Image Process., 2022
Deep Learning for Weakly-Supervised Object Detection and Localization: A Survey.
Neurocomputing, 2022
MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding.
CoRR, 2022
Label Semantic Knowledge Distillation for Unbiased Scene Graph Generation.
CoRR, 2022
Multimodal Event Graphs: Towards Event Centric Understanding of Multimodal World.
CoRR, 2022
Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives.
CoRR, 2022
Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting.
CoRR, 2022
Respecting Transfer Gap in Knowledge Distillation.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
Rethinking the Reference-based Distinctive Image Captioning.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022
Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022
Correspondence Matters for Video Referring Expression Comprehension.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022
Deconfounded Value Decomposition for Multi-Agent Reinforcement Learning.
Proceedings of the International Conference on Machine Learning, 2022
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention.
Proceedings of the Tenth International Conference on Learning Representations, 2022
Rethinking Multi-Modal Alignment in Multi-Choice VideoQA from Feature and Sample Perspectives.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
Weakly-Supervised Temporal Article Grounding.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
Explicit Image Caption Editing.
Proceedings of the Computer Vision - ECCV 2022, 2022
Rethinking Data Augmentation for Robust Visual Question Answering.
Proceedings of the Computer Vision - ECCV 2022, 2022
The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
Few-Shot Object Detection with Fully Cross-Transformer.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
Rethinking the Evaluation of Unbiased Scene Graph Generation.
Proceedings of the 33rd British Machine Vision Conference 2022, 2022
Rethinking the Two-Stage Framework for Grounded Situation Recognition.
Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022
2021
CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.
CoRR, 2021
Deep Learning for Weakly-Supervised Object Detection and Object Localization: A Survey.
CoRR, 2021
VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching.
CoRR, 2021
A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics.
CoRR, 2021
A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric.
Proceedings of the HUMA'21: Proceedings of the 2nd International Workshop on Human-centric Multimedia Analysis, 2021
Instance-wise or Class-wise? A Tale of Neighbor Shapley for Concept-based Explanation.
Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021
Video Relation Detection via Tracklet based Visual Transformer.
Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021
Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning.
Proceedings of the KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021
Accelerate CNNs from Three Dimensions: A Comprehensive Pruning Framework.
Proceedings of the 38th International Conference on Machine Learning, 2021
Natural Language Video Localization with Learnable Moment Proposals.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021
On Pursuit of Designing Multi-modal Transformer for Video Grounding.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021
Human-Like Controllable Image Captioning With Verb-Specific Semantic Roles.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021
Optimizing Federated Learning on Non-IID Data Using Local Shapley Value.
Proceedings of the Artificial Intelligence - First CAAI International Conference, 2021
Boundary Proposal Network for Two-stage Natural Language Video Localization.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021
Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021
2020
Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering.
Neural Process. Lett., 2020
Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding.
CoRR, 2020
Hierarchical Fashion Graph Network for Personalized Outfit Recommendation.
Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020
Counterfactual Samples Synthesizing for Robust Visual Question Answering.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020
Rethinking the Bottom-Up Framework for Query-Based Video Localization.
Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020
2019
Learning Using Privileged Information for Food Recognition.
Proceedings of the 27th ACM International Conference on Multimedia, 2019
Counterfactual Critic Multi-Agent Training for Scene Graph Generation.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019
DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019
2018
Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation.
CoRR, 2018
Zero-Shot Visual Recognition Using Semantics-Preserving Adversarial Embedding Networks.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018
2017
Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Network.
CoRR, 2017
Video Question Answering via Attribute-Augmented Attention Network Learning.
Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017
SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning.
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017
2016
SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning.
CoRR, 2016