Zheng Shou

Orcid: 0000-0002-7681-2166

Affiliations:
  • National University of Singapore
  • Columbia University, New York, NY, USA (former)


According to our database1, Zheng Shou authored at least 183 papers between 2016 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
A large cross-modal video retrieval dataset with reading comprehension.
Pattern Recognit., 2025

2024
Managing Metaverse Data Tsunami: Actionable Insights.
IEEE Trans. Knowl. Data Eng., December, 2024

Continual Learning for Image Segmentation With Dynamic Query.
IEEE Trans. Circuits Syst. Video Technol., June, 2024

Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts.
IEEE Trans. Pattern Anal. Mach. Intell., May, 2024

DR-FER: Discriminative and Robust Representation Learning for Facial Expression Recognition.
IEEE Trans. Multim., 2024

SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels.
Int. J. Comput. Vis., 2024

Skinned Motion Retargeting with Dense Geometric Interaction Perception.
CoRR, 2024

ControLRM: Fast and Controllable 3D Generation via Large Reconstruction Model.
CoRR, 2024

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.
CoRR, 2024

Image Watermarks are Removable Using Controllable Regeneration from Clean Noise.
CoRR, 2024

Unsupervised Prior Learning: Discovering Categorical Pose Priors from Videos.
CoRR, 2024

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos.
CoRR, 2024

High Quality Human Image Animation using Regional Supervision and Motion Blur Condition.
CoRR, 2024

DOTA: Distributional Test-Time Adaptation of Vision-Language Models.
CoRR, 2024

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation.
CoRR, 2024

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation.
CoRR, 2024

GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval.
CoRR, 2024

GUI Action Narrator: Where and When Did That Action Take Place?
CoRR, 2024

VideoGUI: A Benchmark for GUI Automation from Instructional Videos.
CoRR, 2024

Steganalysis on Digital Watermarking: Is Your Defense Truly Impervious?
CoRR, 2024

WMAdapter: Adding WaterMark Control to Latent Diffusion Models.
CoRR, 2024

ProcessPainter: Learn Painting Process from Sequence Data.
CoRR, 2024

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning.
CoRR, 2024

Visual Perception by Large Language Model's Weights.
CoRR, 2024

Multi-Modal Generative Embedding Model.
CoRR, 2024

LOVA3: Learning to Visual Question Answering, Asking and Assessment.
CoRR, 2024

Hallucination of Multimodal Large Language Models: A Survey.
CoRR, 2024

Learning Long-form Video Prior via Generative Pre-Training.
CoRR, 2024

Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models.
CoRR, 2024

Diffusion-Driven Self-Supervised Learning for Shape Reconstruction and Pose Estimation.
CoRR, 2024

Bring Your Own Character: A Holistic Solution for Automatic Facial Animation Generation of Customized Characters.
CoRR, 2024

Skip \n: A Simple Method to Reduce Hallucination in Large Vision-Language Models.
CoRR, 2024

Towards A Better Metric for Text-to-Video Generation.
CoRR, 2024

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions.
CoRR, 2024

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training.
CoRR, 2024

ProcessPainter: Learning to draw from sequence data.
Proceedings of the SIGGRAPH Asia 2024 Conference Papers, 2024

MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

AssistEditor: Multi-Agent Collaboration for GUI Workflow Automation in Video Creation.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition.
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces.
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Spiking-Leaf: A Learnable Auditory Front-End for Spiking Neural Networks.
Proceedings of the IEEE International Conference on Acoustics, 2024

AssistGPT: Towards Multi-modal Agent for Human-Centric AI Assistant.
Proceedings of the 5th International Workshop on Human-centric Multimedia Analysis, 2024

GENIXER: Empowering Multimodal Large Language Model as a Powerful Data Generator.
Proceedings of the Computer Vision - ECCV 2024, 2024

MotionDirector: Motion Customization of Text-to-Video Diffusion Models.
Proceedings of the Computer Vision - ECCV 2024, 2024

Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images.
Proceedings of the Computer Vision - ECCV 2024, 2024

DragAnything: Motion Control for Anything Using Entity Representation.
Proceedings of the Computer Vision - ECCV 2024, 2024

Learning Video Context as Interleaved Multimodal Sequences.
Proceedings of the Computer Vision - ECCV 2024, 2024

Parrot Captions Teach CLIP to Spot Text.
Proceedings of the Computer Vision - ECCV 2024, 2024

RingID: Rethinking Tree-Ring Watermarking for Enhanced Multi-key Identification.
Proceedings of the Computer Vision - ECCV 2024, 2024

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Tune-an-Ellipse: CLIP Has Potential to Find what you Want.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

X- Adapter: Universal Compatibility of Plugins for Upgraded Diffusion Model.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

VIT-LENS: Towards Omni-modal Representations.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Bootstrapping SparseFormers from Vision Foundation Models.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

AssistGUI: Task-Oriented PC Graphical User Interface Automation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

VideoLLM-online: Online Video Large Language Model for Streaming Video.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023
Magi-Net: Meta Negative Network for Early Activity Prediction.
IEEE Trans. Image Process., 2023

Parrot Captions Teach CLIP to Spot Text.
CoRR, 2023

ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors.
CoRR, 2023

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation.
CoRR, 2023

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator.
CoRR, 2023

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model.
CoRR, 2023

ColonNeRF: Neural Radiance Fields for High-Fidelity Long-Sequence Colonoscopy Reconstruction.
CoRR, 2023

MD-Splatting: Learning Metric Deformation from 4D Gaussians in Highly Deformable Scenes.
CoRR, 2023

MLLMs-Augmented Visual-Language Representation Learning.
CoRR, 2023

ViT-Lens-2: Gateway to Omni-modal Intelligence.
CoRR, 2023

Paragraph-to-Image Generation with Information-Enriched Diffusion Model.
CoRR, 2023

CVPR 2023 Text Guided Video Editing Competition.
CoRR, 2023

Integrating View Conditions for Image Synthesis.
CoRR, 2023

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing.
CoRR, 2023

MotionDirector: Motion Customization of Text-to-Video Diffusion Models.
CoRR, 2023

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation.
CoRR, 2023

Bridging Sensor Gaps via Single-Direction Tuning for Hyperspectral Image Classification.
CoRR, 2023

Dataset Condensation via Generative Model.
CoRR, 2023

ViT-Lens: Towards Omni-modal Representations.
CoRR, 2023

Recap: Detecting Deepfake Video with Unpredictable Tampered Traces via Recovering Faces and Mapping Recovered Faces.
CoRR, 2023

Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks.
CoRR, 2023

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023.
CoRR, 2023

TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter.
CoRR, 2023

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn.
CoRR, 2023

VisorGPT: Learning Visual Prior via Generative Pre-Training.
CoRR, 2023

Mover: Mask and Recovery based Facial Part Consistency Aware Method for Deepfake Video Detection.
CoRR, 2023

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video.
CoRR, 2023

Open-World Weakly-Supervised Object Localization.
CoRR, 2023

ICDAR 2023 Video Text Reading Competition for Dense and Small Text.
CoRR, 2023

Attack is Good Augmentation: Towards Skeleton-Contrastive Representation Learning.
CoRR, 2023

Revisit Parameter-Efficient Transfer Learning: A Two-Stage Paradigm.
CoRR, 2023

DeepfakeMAE: Facial Part Consistency Aware Masked Autoencoder for Deepfake Video Detection.
CoRR, 2023

STPrivacy: Spatio-Temporal Tubelet Sparsification and Anonymization for Privacy-preserving Action Recognition.
CoRR, 2023

XAGen: 3D Expressive Human Avatars Generation.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Learning Visual Prior via Generative Pre-Training.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Object-centric Learning with Cyclic Walks between Parts and Whole.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Transformer-based Open-world Instance Segmentation with Cross-task Consistency Regularization.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

Large Generative Models Meet Multimodal Video Intelligence.
Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications, 2023

PV3D: A 3D Generative Model for Portrait Video Generation.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

The Metaverse Data Deluge: What Can We Do About It?
Proceedings of the 39th IEEE International Conference on Data Engineering, 2023

ICDAR 2023 Competition on Video Text Reading for Dense and Small Text.
Proceedings of the Document Analysis and Recognition - ICDAR 2023, 2023

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Label-Efficient Online Continual Object Detection in Streaming Video.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Too Large; Data Reduction for Vision-Language Pre-Training.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Learning to Learn: How to Continuously Teach Humans and Machines.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

UniVTG: Towards Unified Video-Language Temporal Grounding.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Unsupervised Open-Vocabulary Object Localization in Videos.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Revisiting Vision Transformer from the View of Path Ensemble.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Position-Guided Text Prompt for Vision-Language Pre-Training.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

All in One: Exploring Unified Video-Language Pre-Training.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Affordance Grounding from Demonstration Video to Target Image.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

DOAD: Decoupled One Stage Action Detection Network.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Making Vision Transformers Efficient from A Token Sparsification View.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Darwinian Model Upgrades: Model Evolving with Selective Compatibility.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

Video-Text Pre-training with Learned Regions for Retrieval.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
Deep Motion Prior for Weakly-Supervised Temporal Action Localization.
IEEE Trans. Image Process., 2022

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation.
CoRR, 2022

Position-guided Text Prompt for Vision-Language Pre-training.
CoRR, 2022

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis.
CoRR, 2022

Learning to Learn: How to Continuously Teach Humans and Machines.
CoRR, 2022

An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022.
CoRR, 2022

Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization.
CoRR, 2022

Egocentric Video-Language Pretraining @ Ego4D Challenge 2022.
CoRR, 2022

Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022.
CoRR, 2022

Sense The Physical, Walkthrough The Virtual, Manage The Metaverse: A Data-centric Perspective.
CoRR, 2022

Egocentric Video-Language Pretraining.
CoRR, 2022

Novel View Synthesis for High-fidelity Headshot Scenes.
CoRR, 2022

GEB+: A benchmark for generic event boundary captioning, grounding and text-based retrieval.
CoRR, 2022

Revitalize Region Feature for Democratizing Video-Language Pre-training.
CoRR, 2022

All in One: Exploring Unified Video-Language Pre-training.
CoRR, 2022

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Egocentric Video-Language Pretraining.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

AVA-AVD: Audio-visual Speaker Diarization in the Wild.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation.
Proceedings of the HCMA@MM 2022: Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis, 2022

AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning.
Proceedings of the Computer Vision - ECCV 2022, 2022

AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant.
Proceedings of the Computer Vision - ECCV 2022, 2022

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval.
Proceedings of the Computer Vision - ECCV 2022, 2022

Object-aware Video-language Pre-training for Retrieval.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Unified Transformer Tracker for Object Tracking.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022


2021
Video-Text Pre-training with Learned Regions.
CoRR, 2021

AssistSR: Affordance-centric Question-driven Video Segment Retrieval.
CoRR, 2021

AVA-AVD: Audio-visual Speaker Diarization in the Wild.
CoRR, 2021

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video.
CoRR, 2021

Ego4D: Around the World in 3, 000 Hours of Egocentric Video.
CoRR, 2021

Generic Event Boundary Detection: A Benchmark for Event Segmentation.
CoRR, 2021

Is Someone Speaking?: Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection.
Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

Channel Augmented Joint Learning for Visible-Infrared Recognition.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Generic Event Boundary Detection: A Benchmark for Event Segmentation.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Searching for Two-Stream Models in Multivariate Space for Video Recognition.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

On Pursuit of Designing Multi-modal Transformer for Video Grounding.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization.
CoRR, 2020

SF-Net: Single-Frame Supervision for Temporal Action Localization.
Proceedings of the Computer Vision - ECCV 2020, 2020

2019
Deep Learning for Action Understanding in Video.
PhD thesis, 2019

LPAT: Learning to Predict Adaptive Threshold for Weakly-supervised Temporal Action Localization.
CoRR, 2019

CDSA: Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation.
CoRR, 2019

DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018
AutoLoc: Weakly-supervised Temporal Action Localization.
CoRR, 2018

Online Action Detection in Untrimmed, Streaming Videos - Modeling and Evaluation.
CoRR, 2018

Low-shot Learning via Covariance-Preserving Adversarial Augmentation Networks.
Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 2018

Online Detection of Action Start in Untrimmed, Streaming Videos.
Proceedings of the Computer Vision - ECCV 2018, 2018

AutoLoc: Weakly-Supervised Temporal Action Localization in Untrimmed Videos.
Proceedings of the Computer Vision - ECCV 2018, 2018

2017
ConvNet Architecture Search for Spatiotemporal Feature Learning.
CoRR, 2017

CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos.
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

2016
EventNet Version 1.1 Technical Report.
CoRR, 2016

Action Temporal Localization in Untrimmed Videos via Multi-stage CNNs.
CoRR, 2016

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs.
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016


  Loading...