Zheng Shou

Orcid: 0000-0002-7681-2166

Affiliations:

National University of Singapore
Columbia University, New York, NY, USA (former)

According to our database¹, Zheng Shou authored at least 193 papers between 2016 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Bibliography

2025

A large cross-modal video retrieval dataset with reading comprehension.

[BibT_eX]

[DOI]

Pattern Recognit., 2025

2024

Managing Metaverse Data Tsunami: Actionable Insights.

[BibT_eX]

[DOI]

IEEE Trans. Knowl. Data Eng., December, 2024

Continual Learning for Image Segmentation With Dynamic Query.

[BibT_eX]

[DOI]

IEEE Trans. Circuits Syst. Video Technol., June, 2024

Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., May, 2024

DR-FER: Discriminative and Robust Representation Learning for Facial Expression Recognition.

[BibT_eX]

[DOI]

IEEE Trans. Multim., 2024

SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., 2024

DiffSim: Taming Diffusion Models for Evaluating Visual Similarity.

[BibT_eX]

[DOI]

Yiren Song

Xiaokang Liu

Mike Zheng Shou

CoRR, 2024

IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation.

[BibT_eX]

[DOI]

CoRR, 2024

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting.

[BibT_eX]

[DOI]

CoRR, 2024

Anti-Reference: Universal and Immediate Defense Against Reference-Based Generation.

[BibT_eX]

[DOI]

CoRR, 2024

ROICtrl: Boosting Instance Control for Visual Generation.

[BibT_eX]

[DOI]

CoRR, 2024

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.

[BibT_eX]

[DOI]

CoRR, 2024

Factorized Visual Tokenization and Generation.

[BibT_eX]

[DOI]

CoRR, 2024

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation.

[BibT_eX]

[DOI]

CoRR, 2024

FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data.

[BibT_eX]

[DOI]

CoRR, 2024

The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use.

[BibT_eX]

[DOI]

CoRR, 2024

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning.

[BibT_eX]

[DOI]

CoRR, 2024

Skinned Motion Retargeting with Dense Geometric Interaction Perception.

[BibT_eX]

[DOI]

CoRR, 2024

ControLRM: Fast and Controllable 3D Generation via Large Reconstruction Model.

[BibT_eX]

[DOI]

CoRR, 2024

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Image Watermarks are Removable Using Controllable Regeneration from Clean Noise.

[BibT_eX]

[DOI]

CoRR, 2024

Unsupervised Prior Learning: Discovering Categorical Pose Priors from Videos.

[BibT_eX]

[DOI]

CoRR, 2024

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos.

[BibT_eX]

[DOI]

CoRR, 2024

High Quality Human Image Animation using Regional Supervision and Motion Blur Condition.

[BibT_eX]

[DOI]

CoRR, 2024

DOTA: Distributional Test-Time Adaptation of Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation.

[BibT_eX]

[DOI]

CoRR, 2024

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation.

[BibT_eX]

[DOI]

CoRR, 2024

GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval.

[BibT_eX]

[DOI]

CoRR, 2024

GUI Action Narrator: Where and When Did That Action Take Place?

[BibT_eX]

[DOI]

CoRR, 2024

VideoGUI: A Benchmark for GUI Automation from Instructional Videos.

[BibT_eX]

[DOI]

CoRR, 2024

Steganalysis on Digital Watermarking: Is Your Defense Truly Impervious?

[BibT_eX]

[DOI]

CoRR, 2024

WMAdapter: Adding WaterMark Control to Latent Diffusion Models.

[BibT_eX]

[DOI]

CoRR, 2024

ProcessPainter: Learn Painting Process from Sequence Data.

[BibT_eX]

[DOI]

CoRR, 2024

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning.

[BibT_eX]

[DOI]

CoRR, 2024

Visual Perception by Large Language Model's Weights.

[BibT_eX]

[DOI]

CoRR, 2024

Multi-Modal Generative Embedding Model.

[BibT_eX]

[DOI]

CoRR, 2024

LOVA3: Learning to Visual Question Answering, Asking and Assessment.

[BibT_eX]

[DOI]

CoRR, 2024

Hallucination of Multimodal Large Language Models: A Survey.

[BibT_eX]

[DOI]

CoRR, 2024

Learning Long-form Video Prior via Generative Pre-Training.

[BibT_eX]

[DOI]

CoRR, 2024

Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models.

[BibT_eX]

[DOI]

CoRR, 2024

Diffusion-Driven Self-Supervised Learning for Shape Reconstruction and Pose Estimation.

[BibT_eX]

[DOI]

CoRR, 2024

Bring Your Own Character: A Holistic Solution for Automatic Facial Animation Generation of Customized Characters.

[BibT_eX]

[DOI]

CoRR, 2024

Skip \n: A Simple Method to Reduce Hallucination in Large Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Towards A Better Metric for Text-to-Video Generation.

[BibT_eX]

[DOI]

CoRR, 2024

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions.

[BibT_eX]

[DOI]

CoRR, 2024

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training.

[BibT_eX]

[DOI]

CoRR, 2024

ProcessPainter: Learning to draw from sequence data.

[BibT_eX]

[DOI]

Proceedings of the SIGGRAPH Asia 2024 Conference Papers, 2024

MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

AssistEditor: Multi-Agent Collaboration for GUI Workflow Automation in Video Creation.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Spiking-Leaf: A Learnable Auditory Front-End for Spiking Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

AssistGPT: Towards Multi-modal Agent for Human-Centric AI Assistant.

[BibT_eX]

[DOI]

Proceedings of the 5th International Workshop on Human-centric Multimedia Analysis, 2024

GENIXER: Empowering Multimodal Large Language Model as a Powerful Data Generator.

[BibT_eX]

[DOI]

Henry Hengyuan Zhao

Pan Zhou

Mike Zheng Shou

Proceedings of the Computer Vision - ECCV 2024, 2024

MotionDirector: Motion Customization of Text-to-Video Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

DragAnything: Motion Control for Anything Using Entity Representation.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Learning Video Context as Interleaved Multimodal Sequences.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

Parrot Captions Teach CLIP to Spot Text.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

RingID: Rethinking Tree-Ring Watermarking for Enhanced Multi-key Identification.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2024, 2024

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Tune-an-Ellipse: CLIP Has Potential to Find what you Want.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

X- Adapter: Universal Compatibility of Plugins for Upgraded Diffusion Model.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

VIT-LENS: Towards Omni-modal Representations.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Bootstrapping SparseFormers from Vision Foundation Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

AssistGUI: Task-Oriented PC Graphical User Interface Automation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

VideoLLM-online: Online Video Large Language Model for Streaming Video.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023

Magi-Net: Meta Negative Network for Early Activity Prediction.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2023

ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors.

[BibT_eX]

[DOI]

CoRR, 2023

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation.

[BibT_eX]

[DOI]

CoRR, 2023

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator.

[BibT_eX]

[DOI]

Henry Hengyuan Zhao

Pan Zhou

Mike Zheng Shou

CoRR, 2023

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model.

[BibT_eX]

[DOI]

CoRR, 2023

ColonNeRF: Neural Radiance Fields for High-Fidelity Long-Sequence Colonoscopy Reconstruction.

[BibT_eX]

[DOI]

CoRR, 2023

MD-Splatting: Learning Metric Deformation from 4D Gaussians in Highly Deformable Scenes.

[BibT_eX]

[DOI]

Bardienus Pieter Duisterhof

CoRR, 2023

MLLMs-Augmented Visual-Language Representation Learning.

[BibT_eX]

[DOI]

CoRR, 2023

ViT-Lens-2: Gateway to Omni-modal Intelligence.

[BibT_eX]

[DOI]

CoRR, 2023

Paragraph-to-Image Generation with Information-Enriched Diffusion Model.

[BibT_eX]

[DOI]

CoRR, 2023

CVPR 2023 Text Guided Video Editing Competition.

[BibT_eX]

[DOI]

CoRR, 2023

Integrating View Conditions for Image Synthesis.

[BibT_eX]

[DOI]

CoRR, 2023

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing.

[BibT_eX]

[DOI]

CoRR, 2023

MotionDirector: Motion Customization of Text-to-Video Diffusion Models.

[BibT_eX]

[DOI]

CoRR, 2023

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation.

[BibT_eX]

[DOI]

CoRR, 2023

Bridging Sensor Gaps via Single-Direction Tuning for Hyperspectral Image Classification.

[BibT_eX]

[DOI]

CoRR, 2023

Dataset Condensation via Generative Model.

[BibT_eX]

[DOI]

CoRR, 2023

ViT-Lens: Towards Omni-modal Representations.

[BibT_eX]

[DOI]

CoRR, 2023

Recap: Detecting Deepfake Video with Unpredictable Tampered Traces via Recovering Faces and Mapping Recovered Faces.

[BibT_eX]

[DOI]

CoRR, 2023

Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks.

[BibT_eX]

[DOI]

CoRR, 2023

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023.

[BibT_eX]

[DOI]

CoRR, 2023

TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter.

[BibT_eX]

[DOI]

CoRR, 2023

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn.

[BibT_eX]

[DOI]

CoRR, 2023

VisorGPT: Learning Visual Prior via Generative Pre-Training.

[BibT_eX]

[DOI]

CoRR, 2023

Mover: Mask and Recovery based Facial Part Consistency Aware Method for Deepfake Video Detection.

[BibT_eX]

[DOI]

CoRR, 2023

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video.

[BibT_eX]

[DOI]

CoRR, 2023

Open-World Weakly-Supervised Object Localization.

[BibT_eX]

[DOI]

CoRR, 2023

ICDAR 2023 Video Text Reading Competition for Dense and Small Text.

[BibT_eX]

[DOI]

CoRR, 2023

Attack is Good Augmentation: Towards Skeleton-Contrastive Representation Learning.

[BibT_eX]

[DOI]

CoRR, 2023

Revisit Parameter-Efficient Transfer Learning: A Two-Stage Paradigm.

[BibT_eX]

[DOI]

CoRR, 2023

DeepfakeMAE: Facial Part Consistency Aware Masked Autoencoder for Deepfake Video Detection.

[BibT_eX]

[DOI]

CoRR, 2023

STPrivacy: Spatio-Temporal Tubelet Sparsification and Anonymization for Privacy-preserving Action Recognition.

[BibT_eX]

[DOI]

CoRR, 2023

XAGen: 3D Expressive Human Avatars Generation.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Learning Visual Prior via Generative Pre-Training.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Object-centric Learning with Cyclic Walks between Parts and Whole.

[BibT_eX]

[DOI]

Ziyu Wang

Mike Zheng Shou

Mengmi Zhang

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Transformer-based Open-world Instance Segmentation with Cross-task Consistency Regularization.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Multimedia, 2023

Large Generative Models Meet Multimodal Video Intelligence.

[BibT_eX]

[DOI]

Mike Zheng Shou

Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications, 2023

PV3D: A 3D Generative Model for Portrait Video Generation.

[BibT_eX]

[DOI]

Proceedings of the Eleventh International Conference on Learning Representations, 2023

The Metaverse Data Deluge: What Can We Do About It?

[BibT_eX]

[DOI]

Proceedings of the 39th IEEE International Conference on Data Engineering, 2023

ICDAR 2023 Competition on Video Text Reading for Dense and Small Text.

[BibT_eX]

[DOI]

Proceedings of the Document Analysis and Recognition - ICDAR 2023, 2023

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Label-Efficient Online Continual Object Detection in Streaming Video.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Too Large; Data Reduction for Vision-Language Pre-Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Learning to Learn: How to Continuously Teach Humans and Machines.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

UniVTG: Towards Unified Video-Language Temporal Grounding.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Unsupervised Open-Vocabulary Object Localization in Videos.

[BibT_eX]

[DOI]

Carl-Johann Simon-Gabriel

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Revisiting Vision Transformer from the View of Path Ensemble.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations.

[BibT_eX]

[DOI]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Position-Guided Text Prompt for Vision-Language Pre-Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

All in One: Exploring Unified Video-Language Pre-Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Affordance Grounding from Demonstration Video to Target Image.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

DOAD: Decoupled One Stage Action Detection Network.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Making Vision Transformers Efficient from A Token Sparsification View.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding.

[BibT_eX]

[DOI]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Darwinian Model Upgrades: Model Evolving with Selective Compatibility.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

Video-Text Pre-training with Learned Regions for Retrieval.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022

Deep Motion Prior for Weakly-Supervised Temporal Action Localization.

[BibT_eX]

[DOI]

IEEE Trans. Image Process., 2022

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation.

[BibT_eX]

[DOI]

CoRR, 2022

Position-guided Text Prompt for Vision-Language Pre-training.

[BibT_eX]

[DOI]

CoRR, 2022

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis.

[BibT_eX]

[DOI]

CoRR, 2022

Learning to Learn: How to Continuously Teach Humans and Machines.

[BibT_eX]

[DOI]

CoRR, 2022

An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022.

[BibT_eX]

[DOI]

CoRR, 2022

Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization.

[BibT_eX]

[DOI]

CoRR, 2022

Egocentric Video-Language Pretraining @ Ego4D Challenge 2022.

[BibT_eX]

[DOI]

CoRR, 2022

Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022.

[BibT_eX]

[DOI]

CoRR, 2022

Sense The Physical, Walkthrough The Virtual, Manage The Metaverse: A Data-centric Perspective.

[BibT_eX]

[DOI]

CoRR, 2022

Egocentric Video-Language Pretraining.

[BibT_eX]

[DOI]

CoRR, 2022

Novel View Synthesis for High-fidelity Headshot Scenes.

[BibT_eX]

[DOI]

CoRR, 2022

GEB+: A benchmark for generic event boundary captioning, grounding and text-based retrieval.

[BibT_eX]

[DOI]

CoRR, 2022

Revitalize Region Feature for Democratizing Video-Language Pre-training.

[BibT_eX]

[DOI]

CoRR, 2022

All in One: Exploring Unified Video-Language Pre-training.

[BibT_eX]

[DOI]

CoRR, 2022

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Egocentric Video-Language Pretraining.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

AVA-AVD: Audio-visual Speaker Diarization in the Wild.

[BibT_eX]

[DOI]

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA.

[BibT_eX]

[DOI]

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation.

[BibT_eX]

[DOI]

Proceedings of the HCMA@MM 2022: Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis, 2022

AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2022, 2022

Object-aware Video-language Pre-training for Retrieval.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Unified Transformer Tracker for Object Tracking.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Ego4D: Around the World in 3, 000 Hours of Egocentric Video.

[BibT_eX]

[DOI]

Santhosh Kumar Ramakrishnan

Christoph Feichtenhofer

Giovanni Maria Farinella

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021

Video-Text Pre-training with Learned Regions.

[BibT_eX]

[DOI]

CoRR, 2021

AssistSR: Affordance-centric Question-driven Video Segment Retrieval.

[BibT_eX]

[DOI]

CoRR, 2021

AVA-AVD: Audio-visual Speaker Diarization in the Wild.

[BibT_eX]

[DOI]

CoRR, 2021

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video.

[BibT_eX]

[DOI]

CoRR, 2021

Ego4D: Around the World in 3, 000 Hours of Egocentric Video.

[BibT_eX]

[DOI]

Santhosh Kumar Ramakrishnan

Christoph Feichtenhofer

Giovanni Maria Farinella

CoRR, 2021

Generic Event Boundary Detection: A Benchmark for Event Segmentation.

[BibT_eX]

[DOI]

CoRR, 2021

Is Someone Speaking?: Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection.

[BibT_eX]

[DOI]

Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

Channel Augmented Joint Learning for Visible-Infrared Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Generic Event Boundary Detection: A Benchmark for Event Segmentation.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Searching for Two-Stream Models in Multivariate Space for Video Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

On Pursuit of Designing Multi-modal Transformer for Video Grounding.

[BibT_eX]

[DOI]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization.

[BibT_eX]

[DOI]

CoRR, 2020

SF-Net: Single-Frame Supervision for Temporal Action Localization.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2020, 2020

2019

Deep Learning for Action Understanding in Video.

[BibT_eX]

[DOI]

Zheng Shou

PhD thesis, 2019

LPAT: Learning to Predict Adaptive Threshold for Weakly-supervised Temporal Action Localization.

[BibT_eX]

[DOI]

Xudong Lin

Zheng Shou

Shih-Fu Chang

CoRR, 2019

CDSA: Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation.

[BibT_eX]

[DOI]

CoRR, 2019

DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018

AutoLoc: Weakly-supervised Temporal Action Localization.

[BibT_eX]

[DOI]

CoRR, 2018

Online Action Detection in Untrimmed, Streaming Videos - Modeling and Evaluation.

[BibT_eX]

[DOI]

CoRR, 2018

Low-shot Learning via Covariance-Preserving Adversarial Augmentation Networks.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 2018

Online Detection of Action Start in Untrimmed, Streaming Videos.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2018, 2018

AutoLoc: Weakly-Supervised Temporal Action Localization in Untrimmed Videos.

[BibT_eX]

[DOI]

Proceedings of the Computer Vision - ECCV 2018, 2018

2017

ConvNet Architecture Search for Spatiotemporal Feature Learning.

[BibT_eX]

[DOI]

CoRR, 2017

CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

2016

EventNet Version 1.1 Technical Report.

[BibT_eX]

[DOI]

CoRR, 2016

Action Temporal Localization in Untrimmed Videos via Multi-stage CNNs.

[BibT_eX]

[DOI]

Zheng Shou

Dongang Wang

Shih-Fu Chang

CoRR, 2016

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs.

[BibT_eX]

[DOI]

Zheng Shou

Dongang Wang

Shih-Fu Chang

Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

Zheng Shou

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...