Sheng Zhao

Orcid: 0000-0002-9624-5381

According to our database1, Sheng Zhao authored at least 110 papers between 1996 and 2024.

Collaborative distances:



In proceedings 
PhD thesis 




NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.
IEEE Trans. Pattern Anal. Mach. Intell., June, 2024

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech.
CoRR, 2024

Autoregressive Speech Synthesis without Vector Quantization.
CoRR, 2024

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS.
CoRR, 2024

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment.
CoRR, 2024

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS.
CoRR, 2024

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers.
CoRR, 2024

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation.
CoRR, 2024

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations.
CoRR, 2024

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis.
CoRR, 2024

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.
CoRR, 2024

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like.
CoRR, 2024

UniAudio: Towards Universal Audio Generation with Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

PromptTTS 2: Describing and Generating Voices with Text Prompt.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

GAIA: Zero-shot Talking Avatar Generation.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Two-Stage Optimal Trajectory Planning Based on Resilience Adjustment Model for Virtually Coupled Trains.
IEEE Trans. Intell. Transp. Syst., December, 2023

StableFace: Analyzing and Improving Motion Stability for Talking Face Generation.
IEEE J. Sel. Top. Signal Process., November, 2023

Robust adaptive Unscented Kalman Filter with gross error detection and identification for power system forecasting-aided state estimation.
J. Frankl. Inst., September, 2023

The First High-quality Reference Genome of Sika Deer Provides Insights into High-tannin Adaptation.
Genom. Proteom. Bioinform., 2023

UniAudio: An Audio Foundation Model Toward Universal Audio Generation.
CoRR, 2023

PromptTTS 2: Describing and Generating Voices with Text Prompt.
CoRR, 2023

The detection and rectification for identity-switch based on unfalsified control.
CoRR, 2023

An End-to-End Multi-Module Audio Deepfake Generation System for ADD Challenge 2023.
CoRR, 2023

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.
CoRR, 2023

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling.
CoRR, 2023

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model.
CoRR, 2023

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers.
CoRR, 2023

AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Large-Scale Automatic Audiobook Creation.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

HiFace: High-Fidelity 3D Face Reconstruction by Learning Static and Dynamic Details.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

LeanSpeech: The Microsoft Lightweight Speech Synthesis System for Limmits Challenge 2023.
Proceedings of the IEEE International Conference on Acoustics, 2023

Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation.
Proceedings of the IEEE International Conference on Acoustics, 2023

Prompttts: Controllable Text-To-Speech With Text Descriptions.
Proceedings of the IEEE International Conference on Acoustics, 2023

MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023.
Proceedings of the 18th Blizzard Challenge Workshop, Grenoble, France, August 29, 2023, 2023

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

Towards Contextual Spelling Correction for Customization of End-to-End Speech Recognition Systems.
IEEE ACM Trans. Audio Speech Lang. Process., 2022

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech.
CoRR, 2022

Memories are One-to-Many Mapping Alleviators in Talking Face Generation.
CoRR, 2022

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

MeloForm: Generating Melody with Musical Form based on Expert Systems and Neural Networks.
Proceedings of the 23rd International Society for Music Information Retrieval Conference, 2022

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

A Study on the Efficacy of Model Pre-Training In Developing Neural Text-to-Speech System.
Proceedings of the IEEE International Conference on Acoustics, 2022

Transformer-S2A: Robust and Efficient Speech-to-Animation.
Proceedings of the IEEE International Conference on Acoustics, 2022

Infergrad: Improving Diffusion Models for Vocoder by Considering Inference in Training.
Proceedings of the IEEE International Conference on Acoustics, 2022

Design and Adaptive Control of Matrix Transformer Based Indirect Converter for Large-Capacity Circuit Breaker Testing Application.
IEEE Trans. Ind. Electron., 2021

AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style.
CoRR, 2021

Adaptive Text to Speech for Spontaneous Style.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

A Light-Weight Contextual Spelling Correction Model for Customizing Transducer-Based Speech Recognition Systems.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

AdaSpeech: Adaptive Text to Speech for Custom Voice.
Proceedings of the 9th International Conference on Learning Representations, 2021

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
Proceedings of the 9th International Conference on Learning Representations, 2021

Denoispeech: Denoising Text to Speech with Frame-Level Noise Modeling.
Proceedings of the IEEE International Conference on Acoustics, 2021

Adaspeech 2: Adaptive Text to Speech with Untranscribed Data.
Proceedings of the IEEE International Conference on Acoustics, 2021

Lightspeech: Lightweight and Fast Text to Speech with Neural Architecture Search.
Proceedings of the IEEE International Conference on Acoustics, 2021

MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network.
Proceedings of the IEEE International Conference on Acoustics, 2021

DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021.
Proceedings of the Blizzard Challenge 2021, virtual, October 23, 2021, 2021

Vital Sign Detection during Large-Scale and Fast Body Movements Based on an Adaptive Noise Cancellation Algorithm Using a Single Doppler Radar Sensor.
Sensors, 2020

Accurate Doppler radar-based heart rate measurement using matched filter.
IEICE Electron. Express, 2020

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition.
Proceedings of the KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020

Enhancing Monotonicity for Robust Autoregressive Transformer TTS.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

MoBoAligner: A Neural Alignment Model for Non-Autoregressive TTS with Monotonic Boundary Search.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

MultiSpeech: Multi-Speaker Text to Speech with Transformer.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Semantic Mask for Transformer Based End-to-End Speech Recognition.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

A Study of Non-autoregressive Model for Sequence Generation.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

RobuTrans: A Robust Transformer-Based Text-to-Speech Model.
Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

Correlation Analysis of Breast Cancer DWI Combined with DCE-MRI Imaging Features with Molecular Subtypes and Prognostic Factors.
J. Medical Syst., 2019

Application of MRI and CT Energy Spectrum Imaging in Hand and Foot Tendon Lesions.
J. Medical Syst., 2019

A Methodology of Timing Co-Evolutionary Path Optimization for Accident Emergency Rescue Considering Future Environmental Uncertainty.
IEEE Access, 2019

FastSpeech: Fast, Robust and Controllable Text to Speech.
Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019

A Resilience Adjustment Method for Real-time Cooperative Optimization of High-speed Trains.
Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference, 2019

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Towards Discriminative Representation Learning for Speech Emotion Recognition.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019

Almost Unsupervised Text to Speech and Automatic Speech Recognition.
Proceedings of the 36th International Conference on Machine Learning, 2019

Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2019

Knowledge Distillation from Bert in Pre-Training and Fine-Tuning for Polyphone Disambiguation.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Neural Speech Synthesis with Transformer Network.
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

Close to Human Quality TTS with Transformer.
CoRR, 2018

A Two-stage Method to Optimise Driving Strategy and Timetable for High-speed Trains.
Proceedings of the 21st International Conference on Intelligent Transportation Systems, 2018

Hypervisor based approach for integrated cockpit solutions.
Proceedings of the 8th IEEE International Conference on Consumer Electronics - Berlin, 2018

High-Precision Vehicle Navigation in Urban Environments Using an MEM's IMU and Single-Frequency GPS Receiver.
IEEE Trans. Intell. Transp. Syst., 2016

Computationally Efficient Carrier Integer Ambiguity Resolution in Multiepoch GPS/INS: A Common-Position-Shift Approach.
IEEE Trans. Control. Syst. Technol., 2016

Synthesis and Characterization of Magnetic Polyvinyl Alcohol (PVA) Hydrogel Microspheres for the Embolization of Blood Vessel.
IEEE Trans. Biomed. Eng., 2016

Multi-authority E-voting System Based on Group Blind Signature.
Int. J. Online Eng., 2015

High reliability integer ambiguity resolution of 6DOF RTK GPS/INS.
Proceedings of the 53rd IEEE Conference on Decision and Control, 2014

Quaternion-based trajectory tracking control of VTOL-UAVs using command filtered backstepping.
Proceedings of the American Control Conference, 2013

2D LIDAR Aided INS for vehicle positioning in urban environments.
Proceedings of the IEEE International Conference on Control Applications, 2013

Self-Localization and Tracking of Multiple robots in Experimental setups.
Int. J. Robotics Autom., 2012

Resources Collaborative Scheduling Model Based on Trust Mechanism in Cloud.
Proceedings of the 11th IEEE International Conference on Trust, 2012

Turning a Monolingual Speaker into Multilingual for a Mixed-language TTS.
Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

MobiMsg: A Resource-Efficient Location-Based Mobile Instant Messaging System.
Proceedings of the 2012 Second International Conference on Cloud and Green Computing, 2012

Optimization-based road curve fitting.
Proceedings of the 50th IEEE Conference on Decision and Control and European Control Conference, 2011

Density-based control of multiple robots.
Proceedings of the American Control Conference, 2011

A novel way to implement self-localization in a multi-robot experimental platform.
Proceedings of the American Control Conference, 2010

Analysis of synonymous codon usage in 11 Human Bocavirus isolates.
Biosyst., 2008

Comprehensive Algorithm for Quantitative Real-Time Polymerase Chain Reaction.
J. Comput. Biol., 2005

Chinese prosodic phrasing with extended features.
Proceedings of the 2003 IEEE International Conference on Acoustics, 2003

Automatic stress prediction of Chinese speech synthesis.
Proceedings of the 2002 International Symposium on Chinese Spoken Language Processing, 2002

Prosodic phrasing with inductive learning.
Proceedings of the 7th International Conference on Spoken Language Processing, ICSLP2002, 2002

Learning Rules for Chinese Prosodic Phrase Prediction.
Proceedings of the First Workshop on Chinese Language Processing, 2002

Motif neural network design for large-scale protein family identification.
Proceedings of International Conference on Neural Networks (ICNN'97), 1997

A Protein Class Database Organized with ProSite Protein Groups and PIR Superfamilies.
J. Comput. Biol., 1996

Motif identification neural design for rapid and sensitive protein family search.
Comput. Appl. Biosci., 1996
