2025

Decomposed Prototype Learning for Few-Shot Scene Graph Generation.

[DOI]

,

,

,

,

,

,

ACM Trans. Multim. Comput. Commun. Appl., January, 2025

ENCODE: Breaking the Trade-Off Between Performance and Efficiency in Long-Term User Behavior Modeling.

[DOI]

,

,

,

,

,

,

,

IEEE Trans. Knowl. Data Eng., January, 2025

Learning Combinatorial Prompts for Universal Controllable Image Captioning.

[DOI]

,

,

,

,

,

Int. J. Comput. Vis., January, 2025

From Easy to Hard: Learning Curricular Shape-Aware Features for Robust Panoptic Scene Graph Generation.

[DOI]

,

,

,

,

Int. J. Comput. Vis., January, 2025

Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation.

[DOI]

,

,

,

,

,

,

,

,

CoRR, January, 2025

2024

Improving Reference-Based Distinctive Image Captioning with Contrastive Rewards.

[DOI]

,

,

,

,

,

,

ACM Trans. Multim. Comput. Commun. Appl., December, 2024

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future.

[DOI]

,

IEEE Trans. Pattern Anal. Mach. Intell., December, 2024

NICEST: Noisy Label Correction and Training for Robust Scene Graph Generation.

[DOI]

,

,

,

,

,

,

IEEE Trans. Pattern Anal. Mach. Intell., October, 2024

CrossFormer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention.

[DOI]

,

,

,

,

,

,

,

IEEE Trans. Pattern Anal. Mach. Intell., May, 2024

Label Semantic Knowledge Distillation for Unbiased Scene Graph Generation.

[DOI]

,

,

,

,

,

,

,

IEEE Trans. Circuits Syst. Video Technol., January, 2024

In Defense of Clip-Based Video Relation Detection.

[DOI]

,

,

,

,

Roger Zimmermann

IEEE Trans. Image Process., 2024

GSSF: Generalized Structural Sparse Function for Deep Cross-Modal Metric Learning.

[DOI]

,

,

,

,

,

IEEE Trans. Image Process., 2024

Learning Causal Transition Matrix for Instance-dependent Label Noise.

[DOI]

,

,

,

,

,

CoRR, 2024

Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing.

[DOI]

,

,

,

,

,

CoRR, 2024

IterIS: Iterative Inference-Solving Alignment for LoRA Merging.

[DOI]

,

,

,

,

CoRR, 2024

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback.

[DOI]

,

,

,

,

CoRR, 2024

A Comprehensive Survey of Datasets, Theories, Variants, and Applications in Direct Preference Optimization.

[DOI]

,

,

,

,

,

,

,

,

,

CoRR, 2024

Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing.

[DOI]

,

,

CoRR, 2024

Event-Customized Image Generation.

[DOI]

,

,

,

,

CoRR, 2024

A Survey on Multimodal Benchmarks: In the Era of Large AI Models.

[DOI]

,

,

,

,

CoRR, 2024

Improving Diffusion-based Data Augmentation with Inversion Spherical Interpolation.

[DOI]

,

CoRR, 2024

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding.

[DOI]

,

,

CoRR, 2024

MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding.

[DOI]

,

,

,

,

,

,

,

,

CoRR, 2024

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation.

[DOI]

,

,

,

,

,

,

,

CoRR, 2024

Di<sup>2</sup>Pose: Discrete Diffusion Model for Occluded 3D Human Pose Estimation.

[DOI]

,

,

,

,

,

CoRR, 2024

FreeTuner: Any Subject in Any Style with Training-free Diffusion.

[DOI]

,

,

,

,

CoRR, 2024

Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation.

[DOI]

Xiaoshuang Huang

,

,

,

,

,

CoRR, 2024

Boundary and Relation Distillation for Semantic Segmentation.

[DOI]

,

,

,

,

Kwang-Ting Cheng

CoRR, 2024

LLMs Can Evolve Continually on Modality for X-Modal Reasoning.

[DOI]

,

,

,

,

,

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

$\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation.

[DOI]

,

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

The 2nd International Workshop on Deep Multi-modal Generation and Retrieval.

[DOI]

,

,

,

,

,

,

,

,

Roger Zimmermann

Proceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval, 2024

PROMOTE: Prior-Guided Diffusion Model with Global-Local Contrastive Learning for Exemplar-Based Image Translation.

[DOI]

,

,

,

,

,

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer.

[DOI]

,

,

,

,

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning.

[DOI]

,

,

Proceedings of the 2024 International Conference on Multimedia Retrieval, 2024

ClothPPO: A Proximal Policy Optimization Enhancing Framework for Robotic Cloth Manipulation with Observation-Aligned Action Spaces.

[DOI]

,

,

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos.

[DOI]

,

,

,

,

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Mrtnet: Multi-Resolution Temporal Network for Video Sentence Grounding.

[DOI]

,

,

,

,

,

Roger Zimmermann

Proceedings of the IEEE International Conference on Acoustics, 2024

MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning.

[DOI]

,

,

,

,

,

,

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

View-Consistent 3D Editing with Gaussian Splatting.

[DOI]

,

,

,

,

,

Proceedings of the Computer Vision - ECCV 2024, 2024

DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism.

[DOI]

,

,

,

,

Proceedings of the Computer Vision - ECCV 2024, 2024

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning.

[DOI]

,

,

,

,

,

,

Proceedings of the Computer Vision - ECCV 2024, 2024

An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding.

[DOI]

,

,

Proceedings of the Computer Vision - ECCV 2024, 2024

Distributionally Generative Augmentation for Fair Facial Attribute Classification.

[DOI]

,

,

,

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory.

[DOI]

,

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter.

[DOI]

,

,

,

,

,

,

,

,

,

Proceedings of the Findings of the Association for Computational Linguistics, 2024

Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities.

[DOI]

Hammad A. Ayyubi

,

Christopher Thomas

,

,

,

,

,

,

,

,

,

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach.

[DOI]

,

,

,

,

,

,

ACM Trans. Multim. Comput. Commun. Appl., November, 2023

Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering.

[DOI]

,

,

,

,

IEEE Trans. Pattern Anal. Mach. Intell., November, 2023

Federated unsupervised representation learning.

[DOI]

,

,

,

,

,

,

,

,

,

,

Frontiers Inf. Technol. Electron. Eng., August, 2023

VL-NMS: Breaking Proposal Bottlenecks in Two-stage Visual-language Matching.

[DOI]

,

,

,

,

,

,

ACM Trans. Multim. Comput. Commun. Appl., 2023

DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism.

[DOI]

,

,

,

CoRR, 2023

Compositional Zero-shot Learning via Progressive Language-based Observations.

[DOI]

,

,

,

CoRR, 2023

Video Referring Expression Comprehension via Transformer with Content-conditioned Query.

[DOI]

,

,

,

,

,

CoRR, 2023

MEDOE: A Multi-Expert Decoder and Output Ensemble Framework for Long-tailed Semantic Segmentation.

[DOI]

,

,

,

,

,

CoRR, 2023

Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs.

[DOI]

,

,

,

Christopher Thomas

,

,

CoRR, 2023

TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding.

[DOI]

,

,

,

,

CoRR, 2023

Decomposed Prototype Learning for Few-Shot Scene Graph Generation.

[DOI]

,

,

,

,

,

CoRR, 2023

Learning Combinatorial Prompts for Universal Controllable Image Captioning.

[DOI]

,

,

,

,

,

CoRR, 2023

Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models.

[DOI]

,

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Two Heads are Better Than One: A Simple Exploration Framework for Efficient Multi-Agent Reinforcement Learning.

[DOI]

,

,

,

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Video Referring Expression Comprehension via Transformer with Content-conditioned Query.

[DOI]

,

,

,

,

,

Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval, 2023

Discrepancy-Guided Reconstruction Learning for Image Forgery Detection.

[DOI]

,

,

,

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

Fairness-aware Contrastive Learning with Partially Annotated Sensitive Attributes.

[DOI]

,

,

,

,

,

Proceedings of the Eleventh International Conference on Learning Representations, 2023

TempCLR: Temporal Alignment Representation with Contrastive Learning.

[DOI]

,

,

,

,

,

,

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection.

[DOI]

,

,

,

,

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Video Scene Graph Generation from Single-Frame Weak Supervision.

[DOI]

,

,

Proceedings of the Eleventh International Conference on Learning Representations, 2023

Compositional Feature Augmentation for Unbiased Scene Graph Generation.

[DOI]

,

,

,

,

,

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models.

[DOI]

,

,

,

,

,

Hammad A. Ayyubi

,

,

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models.

[DOI]

,

,

,

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

Iterative Proposal Refinement for Weakly-Supervised Video Grounding.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Enhanced Chart Understanding via Visual Language Pre-training on Plot Table Pairs.

[DOI]

,

,

,

Christopher Thomas

,

,

Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

2022

Deep Motion Prior for Weakly-Supervised Temporal Action Localization.

[DOI]

,

,

,

Mike Zheng Shou

,

IEEE Trans. Image Process., 2022

Deep Learning for Weakly-Supervised Object Detection and Localization: A Survey.

[DOI]

,

,

,

,

,

,

,

Neurocomputing, 2022

MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding.

[DOI]

,

,

,

,

CoRR, 2022

Label Semantic Knowledge Distillation for Unbiased Scene Graph Generation.

[DOI]

,

,

,

,

,

,

CoRR, 2022

Multimodal Event Graphs: Towards Event Centric Understanding of Multimodal World.

[DOI]

Hammad A. Ayyubi

,

Christopher Thomas

,

,

,

,

,

,

,

,

CoRR, 2022

Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives.

[DOI]

,

,

,

,

,

CoRR, 2022

Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting.

[DOI]

,

,

,

,

,

CoRR, 2022

Respecting Transfer Gap in Knowledge Distillation.

[DOI]

,

,

,

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Rethinking the Reference-based Distinctive Image Captioning.

[DOI]

,

,

,

,

,

,

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation.

[DOI]

,

,

,

,

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Correspondence Matters for Video Referring Expression Comprehension.

[DOI]

,

,

,

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Deconfounded Value Decomposition for Multi-Agent Reinforcement Learning.

[DOI]

,

,

,

,

,

,

,

Proceedings of the International Conference on Machine Learning, 2022

CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention.

[DOI]

,

,

,

,

,

,

Proceedings of the Tenth International Conference on Learning Representations, 2022

Rethinking Multi-Modal Alignment in Multi-Choice VideoQA from Feature and Sample Perspectives.

[DOI]

,

,

,

,

,

,

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

Weakly-Supervised Temporal Article Grounding.

[DOI]

,

,

,

,

,

Christopher Thomas

,

Hammad A. Ayyubi

,

,

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

Explicit Image Caption Editing.

[DOI]

,

,

,

,

,

,

Proceedings of the Computer Vision - ECCV 2022, 2022

Rethinking Data Augmentation for Robust Visual Question Answering.

[DOI]

,

,

Proceedings of the Computer Vision - ECCV 2022, 2022

The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation.

[DOI]

,

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Few-Shot Object Detection with Fully Cross-Transformer.

[DOI]

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs.

[DOI]

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Rethinking the Evaluation of Unbiased Scene Graph Generation.

[DOI]

,

,

,

,

,

Proceedings of the 33rd British Machine Vision Conference 2022, 2022

Rethinking the Two-Stage Framework for Grounded Situation Recognition.

[DOI]

,

,

,

,

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021

CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

[DOI]

,

,

,

,

,

CoRR, 2021

Deep Learning for Weakly-Supervised Object Detection and Object Localization: A Survey.

[DOI]

,

,

,

,

,

,

,

CoRR, 2021

VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching.

[DOI]

,

,

,

,

,

CoRR, 2021

A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics.

[DOI]

,

,

,

,

,

CoRR, 2021

A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric.

[DOI]

,

,

,

,

,

Proceedings of the HUMA'21: Proceedings of the 2nd International Workshop on Human-centric Multimedia Analysis, 2021

Instance-wise or Class-wise? A Tale of Neighbor Shapley for Concept-based Explanation.

[DOI]

,

,

,

,

,

,

Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

Video Relation Detection via Tracklet based Visual Transformer.

[DOI]

,

,

,

Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning.

[DOI]

,

,

,

,

,

,

Proceedings of the KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021

Accelerate CNNs from Three Dimensions: A Comprehensive Pruning Framework.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the 38th International Conference on Machine Learning, 2021

Natural Language Video Localization with Learnable Moment Proposals.

[DOI]

,

,

,

,

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

On Pursuit of Designing Multi-modal Transformer for Video Grounding.

[DOI]

,

,

Mike Zheng Shou

,

,

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

Human-Like Controllable Image Captioning With Verb-Specific Semantic Roles.

[DOI]

,

,

,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Optimizing Federated Learning on Non-IID Data Using Local Shapley Value.

[DOI]

,

,

,

,

,

Proceedings of the Artificial Intelligence - First CAAI International Conference, 2021

Boundary Proposal Network for Two-stage Natural Language Video Localization.

[DOI]

,

,

,

,

,

,

Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding.

[DOI]

,

,

,

,

Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering.

[DOI]

,

,

,

,

,

,

,

Neural Process. Lett., 2020

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding.

[DOI]

,

,

,

,

,

CoRR, 2020

Hierarchical Fashion Graph Network for Personalized Outfit Recommendation.

[DOI]

,

,

,

,

,

Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020

Counterfactual Samples Synthesizing for Robust Visual Question Answering.

[DOI]

,

,

,

,

,

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Rethinking the Bottom-Up Framework for Query-Based Video Localization.

[DOI]

,

,

,

,

,

,

Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2019

Learning Using Privileged Information for Food Recognition.

[DOI]

,

,

,

,

,

,

Proceedings of the 27th ACM International Conference on Multimedia, 2019

Counterfactual Critic Multi-Agent Training for Scene Graph Generation.

[DOI]

,

,

,

,

,

Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization.

[DOI]

,

,

,

,

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019

2018

Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation.

[DOI]

,

,

,

,

,

CoRR, 2018

Zero-Shot Visual Recognition Using Semantics-Preserving Adversarial Embedding Networks.

[DOI]

,

,

,

,

Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

2017

Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Network.

[DOI]

,

,

,

,

CoRR, 2017

Video Question Answering via Attribute-Augmented Attention Network Learning.

[DOI]

,

,

,

,

,

Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017

SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning.

[DOI]

,

,

,

,

,

,

Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

2016

SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning.

[DOI]

,

,

,

,

,

CoRR, 2016