2025

Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment.

[DOI]

Ke Wang

Wenning Wei

Yan Deng

Lei He

Sheng Zhao

CoRR, September, 2025

Next Tokens Denoising for Speech Synthesis.

[DOI]

CoRR, July, 2025

Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment.

[DOI]

CoRR, March, 2025

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training.

[DOI]

Proceedings of the 2025 IEEE International Conference on Acoustics, 2025

PodAgent: A Comprehensive Framework for Podcast Generation.

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2025

Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation.

[DOI]

Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024

NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., June, 2024

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.

[DOI]

CoRR, 2024

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis.

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Contrastive Context-Speech Pretraining for Expressive Text-to-Speech Synthesis.

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

PromptTTS 2: Describing and Generating Voices with Text Prompt.

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis.

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

2023

PromptTTS 2: Describing and Generating Voices with Text Prompt.

[DOI]

CoRR, 2023

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.

[DOI]

CoRR, 2023

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling.

[DOI]

CoRR, 2023

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model.

[DOI]

CoRR, 2023

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers.

[DOI]

CoRR, 2023

AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading.

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Large-Scale Automatic Audiobook Creation.

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

LeanSpeech: The Microsoft Lightweight Speech Synthesis System for Limmits Challenge 2023.

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation.

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Prosody-Aware Speecht5 for Expressive Neural TTS.

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023.

[DOI]

Proceedings of the 18th Blizzard Challenge Workshop, Grenoble, France, August 29, 2023, 2023

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing.

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech.

[DOI]

CoRR, 2022

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

SoftSpeech: Unsupervised Duration Model in FastSpeech 2.

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis.

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios.

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders.

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge.

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Exploring Machine Speech Chain For Domain Adaptation.

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Prosodyspeech: Towards Advanced Prosody Model for Neural Text-to-Speech.

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Improving Fastspeech TTS with Efficient Self-Attention and Compact Feed-Forward Network.

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Infergrad: Improving Diffusion Models for Vocoder by Considering Inference in Training.

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

2021

Cycle consistent network for end-to-end style transfer TTS training.

[DOI]

Neural Networks, 2021

Exploring Machine Speech Chain for Domain Adaptation and Few-Shot Speaker Adaptation.

[DOI]

CoRR, 2021

Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis.

[DOI]

CoRR, 2021

Conversational End-to-End TTS for Voice Agents.

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Cross-Speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis.

[DOI]

Shifeng Pan

Lei He

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS.

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Speech Bert Embedding for Improving Prosody in Neural TTS.

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021.

[DOI]

Proceedings of the Blizzard Challenge 2021, virtual, October 23, 2021, 2021

On Addressing Practical Challenges for RNN-Transducer.

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

2020

s-Transformer: Segment-Transformer for Robust Neural Speech Synthesis.

[DOI]

CoRR, 2020

Conversational End-to-End TTS for Voice Agent.

[DOI]

CoRR, 2020

On Early-stop Clustering for Speaker Diarization.

[DOI]

Proceedings of the Odyssey 2020: The Speaker and Language Recognition Workshop, 2020

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability.

[DOI]

Sarangarajan Parthasarathy

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

An Efficient Subband Linear Prediction for LPCNet-Based Neural Synthesis.

[DOI]

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator.

[DOI]

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS.

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Adaptation of RNN Transducer with Text-To-Speech Technology for Keyword Spotting.

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Using Personalized Speech Synthesis and Neural Language Generator for Rapid Speaker Adaptation.

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019

Feature reinforcement with word embedding and parsing information in neural TTS.

[DOI]

CoRR, 2019

Forward-Backward Decoding for Regularizing End-to-End TTS.

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS.

[DOI]

Mutian He

Yan Deng

Lei He

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Exploiting Syntactic Features in a Parsed Tree to Improve End-to-End TTS.

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

A New GAN-Based End-to-End TTS Training Algorithm.

[DOI]

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis.

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2019

2018

Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice.

[DOI]

Yan Deng

Lei He

Frank K. Soong

CoRR, 2018

Frame Selection in SI-DNN Phonetic Space with WaveNet Vocoder for Voice Conversion without Parallel Training Data.

[DOI]

Proceedings of the 11th International Symposium on Chinese Spoken Language Processing, 2018

A New Glottal Neural Vocoder for Speech Synthesis.

[DOI]

Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

2016

Modeling F0 trajectories in hierarchically structured deep neural networks.

[DOI]

Speech Commun., 2016

Learning Distributed Word Representations For Bidirectional LSTM Recurrent Neural Network.

[DOI]

Proceedings of the NAACL HLT 2016, 2016

Speaker and language factorization in DNN-based TTS synthesis.

[DOI]

Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Unsupervised speaker adaptation for DNN-based TTS synthesis.

[DOI]

Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

2015

A Unified Tagging Solution: Bidirectional LSTM Recurrent Neural Network with Word Embedding.

[DOI]

CoRR, 2015

Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network.

[DOI]

CoRR, 2015

Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis.

[DOI]

Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Word embedding for recurrent neural network based TTS synthesis.

[DOI]

Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis.

[DOI]

Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

2014

Modeling DCT parameterized F0 trajectory at intonation phrase level with DNN or decision tree.

[DOI]

Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014