2025
Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment.
CoRR, March, 2025
PodAgent: A Comprehensive Framework for Podcast Generation.
CoRR, March, 2025
ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training.
Proceedings of the 2025 IEEE International Conference on Acoustics, 2025
Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025
2024
NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.
,
,
,
,
,
,
,
,
,
,
,
,
,
IEEE Trans. Pattern Anal. Mach. Intell., June, 2024
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024
Contrastive Context-Speech Pretraining for Expressive Text-to-Speech Synthesis.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Forty-first International Conference on Machine Learning, 2024
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
PromptTTS 2: Describing and Generating Voices with Text Prompt.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Twelfth International Conference on Learning Representations, 2024
Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2024
2023
PromptTTS 2: Describing and Generating Voices with Text Prompt.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2023
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.
CoRR, 2023
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling.
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2023
FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model.
CoRR, 2023
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers.
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2023
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023
Large-Scale Automatic Audiobook Creation.
,
,
,
,
,
,
,
,
,
,
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023
LeanSpeech: The Microsoft Lightweight Speech Synthesis System for Limmits Challenge 2023.
Proceedings of the IEEE International Conference on Acoustics, 2023
Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation.
Proceedings of the IEEE International Conference on Acoustics, 2023
Prosody-Aware Speecht5 for Expressive Neural TTS.
Proceedings of the IEEE International Conference on Acoustics, 2023
MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023.
Proceedings of the 18th Blizzard Challenge Workshop, Grenoble, France, August 29, 2023, 2023
VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023
2022
ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech.
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2022
BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
SoftSpeech: Unsupervised Duration Model in FastSpeech 2.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022
Self-supervised Context-aware Style Representation for Expressive Speech Synthesis.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022
DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022
Exploring Machine Speech Chain For Domain Adaptation.
Proceedings of the IEEE International Conference on Acoustics, 2022
Prosodyspeech: Towards Advanced Prosody Model for Neural Text-to-Speech.
Proceedings of the IEEE International Conference on Acoustics, 2022
Improving Fastspeech TTS with Efficient Self-Attention and Compact Feed-Forward Network.
Proceedings of the IEEE International Conference on Acoustics, 2022
Infergrad: Improving Diffusion Models for Vocoder by Considering Inference in Training.
Proceedings of the IEEE International Conference on Acoustics, 2022
2021
Cycle consistent network for end-to-end style transfer TTS training.
Neural Networks, 2021
Exploring Machine Speech Chain for Domain Adaptation and Few-Shot Speaker Adaptation.
CoRR, 2021
Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis.
CoRR, 2021
Conversational End-to-End TTS for Voice Agents.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021
Cross-Speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021
Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021
Speech Bert Embedding for Improving Prosody in Neural TTS.
Proceedings of the IEEE International Conference on Acoustics, 2021
DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021.
Proceedings of the Blizzard Challenge 2021, virtual, October 23, 2021, 2021
On Addressing Practical Challenges for RNN-Transducer.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021
2020
s-Transformer: Segment-Transformer for Robust Neural Speech Synthesis.
CoRR, 2020
Conversational End-to-End TTS for Voice Agent.
CoRR, 2020
On Early-stop Clustering for Speaker Diarization.
Proceedings of the Odyssey 2020: The Speaker and Language Recognition Workshop, 2020
Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability.
,
,
,
,
,
,
,
,
,
,
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020
An Efficient Subband Linear Prediction for LPCNet-Based Neural Synthesis.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020
Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020
Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020
Adaptation of RNN Transducer with Text-To-Speech Technology for Keyword Spotting.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020
Using Personalized Speech Synthesis and Neural Language Generator for Rapid Speaker Adaptation.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020
2019
Feature reinforcement with word embedding and parsing information in neural TTS.
CoRR, 2019
Forward-Backward Decoding for Regularizing End-to-End TTS.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019
Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019
Exploiting Syntactic Features in a Parsed Tree to Improve End-to-End TTS.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019
A New GAN-Based End-to-End TTS Training Algorithm.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019
Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2019
2018
Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice.
CoRR, 2018
Frame Selection in SI-DNN Phonetic Space with WaveNet Vocoder for Voice Conversion without Parallel Training Data.
Proceedings of the 11th International Symposium on Chinese Spoken Language Processing, 2018
A New Glottal Neural Vocoder for Speech Synthesis.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018
2016
Modeling F0 trajectories in hierarchically structured deep neural networks.
Speech Commun., 2016
Learning Distributed Word Representations For Bidirectional LSTM Recurrent Neural Network.
Proceedings of the NAACL HLT 2016, 2016
Speaker and language factorization in DNN-based TTS synthesis.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016
Unsupervised speaker adaptation for DNN-based TTS synthesis.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016
2015
A Unified Tagging Solution: Bidirectional LSTM Recurrent Neural Network with Word Embedding.
CoRR, 2015
Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network.
CoRR, 2015
Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015
Word embedding for recurrent neural network based TTS synthesis.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015
Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015
2014
Modeling DCT parameterized F0 trajectory at intonation phrase level with DNN or decision tree.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014