Zhengyuan Yang

Orcid: 0000-0002-5808-0889

According to our database1, Zhengyuan Yang authored at least 82 papers between 2015 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Introduction to the Special Issue on AI-Generated Content for Multimedia.
IEEE Trans. Circuits Syst. Video Technol., August, 2024

Multimodal Foundation Models: From Specialists to General-Purpose Assistants.
Found. Trends Comput. Graph. Vis., 2024

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation.
CoRR, 2024

EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing.
CoRR, 2024

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models.
CoRR, 2024

Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization.
CoRR, 2024

AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition.
CoRR, 2024

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities.
CoRR, 2024

VideoGUI: A Benchmark for GUI Automation from Instructional Videos.
CoRR, 2024

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos.
CoRR, 2024

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation.
CoRR, 2024

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.
CoRR, 2024

Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition.
CoRR, 2024

Design2Code: How Far Are We From Automating Front-End Engineering?
CoRR, 2024

StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis.
CoRR, 2024

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training.
CoRR, 2024

OpenLEAF: A Novel Benchmark for Open-Domain Interleaved Image-Text Generation.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Bring Metric Functions into Diffusion Models.
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

StrokeNUWA - Tokenizing Strokes for Vector Graphic Synthesis.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation.
Proceedings of the Computer Vision - ECCV 2024, 2024

Idea2Img: Iterative Self-refinement with GPT-4V for Automatic Image Design and Generation.
Proceedings of the Computer Vision - ECCV 2024, 2024

GRiT: A Generative Region-to-Text Transformer for Object Understanding.
Proceedings of the Computer Vision - ECCV 2024, 2024

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Disco: Disentangled Control for Realistic Human Dance Generation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
TransVG++: End-to-End Visual Grounding With Language Conditioned Vision Transformer.
IEEE Trans. Pattern Anal. Mach. Intell., November, 2023

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models.
CoRR, 2023

Interfacing Foundation Models' Embeddings.
CoRR, 2023

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation.
CoRR, 2023

GPT-4V(ision) as A Social Media Analysis Engine.
CoRR, 2023

MM-VID: Advancing Video Understanding with GPT-4V(ision).
CoRR, 2023

DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design.
CoRR, 2023

Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation.
CoRR, 2023

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation.
CoRR, 2023

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision).
CoRR, 2023

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models.
CoRR, 2023

DisCo: Disentangled Control for Referring Human Dance Generation in Real World.
CoRR, 2023

MultiSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos.
CoRR, 2023

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation.
CoRR, 2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action.
CoRR, 2023

Revisiting Transformer for Point Cloud-based 3D Scene Graph Generation.
CoRR, 2023

Learning 3D Photography Videos via Self-supervised Diffusion on Single Images.
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

Prompting GPT-3 To Be Reliable.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

Equivariant Similarity for Vision-Language Foundation Models.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

ReCo: Region-Controlled Text-to-Image Generation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022
GIT: A Generative Image-to-text Transformer for Vision and Language.
Trans. Mach. Learn. Res., 2022

PromptCap: Prompt-Guided Task-Aware Image Captioning.
CoRR, 2022

Cross-modal Contrastive Distillation for Instructional Activity Anticipation.
Proceedings of the 26th International Conference on Pattern Recognition, 2022

Apple Counting Network Before Fruit Thinning Period Based On Dilated Convolution.
Proceedings of the 11th International Conference on Networks, Communication and Computing, 2022

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling.
Proceedings of the Computer Vision - ECCV 2022, 2022

Scaling Up Vision-Language Pretraining for Image Captioning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA.
Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021
Grounding-Tracking-Integration.
IEEE Trans. Circuits Syst. Video Technol., 2021

Scaling Up Vision-Language Pre-training for Image Captioning.
CoRR, 2021

Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling.
CoRR, 2021

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning.
CoRR, 2021

SAT: 2D Semantics Assisted Training for 3D Visual Grounding.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

TransVG: End-to-End Visual Grounding with Transformers.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020
Dynamic Context-guided Capsule Network for Multimodal Machine Translation.
Proceedings of the MM '20: The 28th ACM International Conference on Multimedia, 2020

Weakly Supervised Body Part Segmentation with Pose based Part Priors.
Proceedings of the 25th International Conference on Pattern Recognition, 2020

Pose-based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation.
Proceedings of the 25th International Conference on Pattern Recognition, 2020

Improving One-Stage Visual Grounding by Recursive Sub-query Construction.
Proceedings of the Computer Vision - ECCV 2020, 2020

A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

2019
Action Recognition With Spatio-Temporal Visual Attention on Skeleton Image Sequences.
IEEE Trans. Circuits Syst. Video Technol., 2019

Grounding-Tracking-Integration.
CoRR, 2019

Weakly Supervised Body Part Parsing with Pose based Part Priors.
CoRR, 2019

Human-Centered Emotion Recognition in Animated GIFs.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2019

A Fast and Accurate One-Stage Approach to Visual Grounding.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Attentive Relational Networks for Mapping Images to Scene Graphs.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

2018
End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perception.
CoRR, 2018

End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perceptions.
Proceedings of the 24th International Conference on Pattern Recognition, 2018

Action Recognition with Visual Attention on Skeleton Images.
Proceedings of the 24th International Conference on Pattern Recognition, 2018

2017
Personalized pose estimation for body language understanding.
Proceedings of the 2017 IEEE International Conference on Image Processing, 2017

2015
Curve fitting and optimal interpolation for CNC machining under confined error using quadratic B-splines.
Comput. Aided Des., 2015


  Loading...