2025

LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, March, 2025

2024

InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt.

[DOI]

,

,

,

,

IEEE ACM Trans. Audio Speech Lang. Process., 2024

Gull: A Generative Multifunctional Audio Codec.

[DOI]

,

,

,

,

CoRR, 2024

Consistent and Relevant: Rethink the Query Embedding in General Sound Separation.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2024

Opine: Leveraging a Optimization-Inspired Deep Unfolding Method for Multi-Channel Speech Enhancement.

[DOI]

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2024

DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis.

[DOI]

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2024

Sifisinger: A High-Fidelity End-to-End Singing Voice Synthesizer Based on Source-Filter Model.

[DOI]

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2024

Complexity Scaling for Speech Denoising.

[DOI]

,

,

Proceedings of the IEEE International Conference on Acoustics, 2024

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

2023

Diffsound: Discrete Diffusion Model for Text-to-Sound Generation.

[DOI]

,

,

,

,

,

,

IEEE ACM Trans. Audio Speech Lang. Process., 2023

Integrating Lattice-Free MMI Into End-to-End Speech Recognition.

[DOI]

,

,

,

,

IEEE ACM Trans. Audio Speech Lang. Process., 2023

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

CoRR, 2023

DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis.

[DOI]

,

,

,

,

CoRR, 2023

Rep2wav: Noise Robust text-to-speech Using self-supervised representations.

[DOI]

,

,

,

,

,

CoRR, 2023

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation.

[DOI]

,

,

,

,

,

,

,

,

,

,

CoRR, 2023

Make-A-Voice: Unified Voice Synthesis With Discrete Representation.

[DOI]

,

,

,

,

,

,

,

,

,

CoRR, 2023

HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec.

[DOI]

,

,

,

,

,

CoRR, 2023

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt.

[DOI]

,

,

,

,

,

,

CoRR, 2023

High Fidelity Speech Enhancement with Band-split RNN.

[DOI]

,

,

,

,

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS.

[DOI]

,

,

,

,

,

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Bayes Risk Transducer: Transducer with Controllable Alignment Prediction.

[DOI]

,

,

,

,

,

,

Shinji Watanabe

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression.

[DOI]

,

,

,

,

,

,

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model.

[DOI]

,

,

,

,

,

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias.

[DOI]

,

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Multimedia and Expo, 2023

Bayes Risk CTC: Controllable CTC Alignment in Sequence-to-Sequence Tasks.

[DOI]

,

,

,

,

,

Shinji Watanabe

Proceedings of the Eleventh International Conference on Learning Representations, 2023

TSpeech-AI System Description to the 5th Deep Noise Suppression (DNS) Challenge.

[DOI]

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2023

2022

Improving Mandarin End-to-End Speech Recognition With Word N-Gram Language Model.

[DOI]

,

,

,

,

IEEE Signal Process. Lett., 2022

Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition.

[DOI]

Aswin Shanmugam Subramanian

,

,

Shinji Watanabe

,

,

Comput. Speech Lang., 2022

An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer.

[DOI]

,

,

,

Shinji Watanabe

,

,

Comput. Speech Lang., 2022

The DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022.

[DOI]

,

,

,

,

,

,

CoRR, 2022

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR.

[DOI]

,

,

,

,

,

CoRR, 2022

Integrate Lattice-Free MMI into End-to-End Speech Recognition.

[DOI]

,

,

,

,

CoRR, 2022

Improving Target Sound Extraction with Timestamp Information.

[DOI]

,

,

,

,

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings.

[DOI]

,

,

,

,

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Multi-Channel Speaker Diarization Using Spatial Features for Meetings.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2022

The CUHK-Tencent Speaker Diarization System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2022

Towards end-to-end Speaker Diarization with Generalized Neural Speaker Clustering.

[DOI]

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2022

Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization.

[DOI]

,

,

,

Shi-Xiong Zhang

,

Siddharth Dalmia

,

,

,

Shinji Watanabe

,

Proceedings of the IEEE International Conference on Acoustics, 2022

Consistent Training and Decoding for End-to-End Speech Recognition Using Lattice-Free MMI.

[DOI]

,

,

,

Shi-Xiong Zhang

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2022

Simple Attention Module Based Speaker Verification with Iterative Noisy Label Detection.

[DOI]

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2022

Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-Based Multi-Modal Context Modeling.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2022

2021

Detect what you want: Target Sound Detection.

[DOI]

,

,

,

CoRR, 2021

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio.

[DOI]

,

,

,

,

Wei-Qiang Zhang

,

,

,

,

,

,

,

Sanjeev Khudanpur

,

Shinji Watanabe

,

Shuaijiang Zhao

,

,

,

,

,

,

,

CoRR, 2021

Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis.

[DOI]

,

,

,

,

,

,

CoRR, 2021

VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention.

[DOI]

,

,

,

,

,

,

CoRR, 2021

Neural Mask based Multi-channel Convolutional Beamforming for Joint Dereverberation, Echo Cancellation and Denoising.

[DOI]

,

,

,

,

Shi-Xiong Zhang

,

,

Proceedings of the IEEE Spoken Language Technology Workshop, 2021

TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation.

[DOI]

,

,

,

,

,

,

Shi-Xiong Zhang

,

,

,

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition.

[DOI]

,

,

,

,

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10, 000 Hours of Transcribed Audio.

[DOI]

,

,

,

,

Wei-Qiang Zhang

,

,

,

,

,

,

,

Sanjeev Khudanpur

,

Shinji Watanabe

,

Shuaijiang Zhao

,

,

,

,

,

,

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

A Joint Training Framework of Multi-Look Separator and Speaker Embedding Extractor for Overlapped Speech.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2021

Towards Robust Speaker Verification with Target Speaker Enhancement.

[DOI]

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2021

Self-Supervised Text-Independent Speaker Verification Using Prototypical Momentum Contrastive Learning.

[DOI]

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2021

Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization.

[DOI]

Aswin Shanmugam Subramanian

,

,

Shinji Watanabe

,

,

,

Shi-Xiong Zhang

,

Proceedings of the IEEE International Conference on Acoustics, 2021

Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input.

[DOI]

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2021

Improving RNN Transducer with Target Speaker Extraction and Neural Uncertainty Estimation.

[DOI]

,

,

,

Shinji Watanabe

,

,

Proceedings of the IEEE International Conference on Acoustics, 2021

Replay and Synthetic Speech Detection with Res2Net Architecture.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2021

2020

DurIAN-SC: Duration Informed Attention Network Based Singing Voice Conversion System.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

DurIAN: Duration Informed Attention Network for Speech Synthesis.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Neural Spatio-Temporal Beamformer for Target Speech Separation.

[DOI]

,

,

Shi-Xiong Zhang

,

,

,

,

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Peking Opera Synthesis via Duration Informed Attention Network.

[DOI]

,

,

,

,

,

,

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition.

[DOI]

,

,

,

,

Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Dfsmn-San with Persistent Memory Model for Automatic Speech Recognition.

[DOI]

,

,

,

,

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Far-Field Location Guided Target Speech Extraction Using End-to-End Speech Recognition Objectives.

[DOI]

Aswin Shanmugam Subramanian

,

,

,

Shi-Xiong Zhang

,

,

Shinji Watanabe

,

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Pitchnet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network.

[DOI]

,

,

,

,

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

The Tencent speech synthesis system for Blizzard Challenge 2020.

[DOI]

,

,

,

,

,

,

Proceedings of the Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020

2019

Erratum to: Past review, current progress, and challenges ahead on the cocktail party problem.

[DOI]

,

,

,

,

Frontiers Inf. Technol. Electron. Eng., 2019

Synthesising Expressiveness in Peking Opera via Duration Informed Attention Network.

[DOI]

,

,

,

,

,

,

CoRR, 2019

Learning Singing From Speech.

[DOI]

,

,

,

,

,

,

,

CoRR, 2019

DurIAN: Duration Informed Attention Network For Multimodal Synthesis.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

CoRR, 2019

Large Margin Training for Attention Based End-to-End Speech Recognition.

[DOI]

,

,

,

Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Joint Training of Complex Ratio Mask Based Beamformer and Acoustic Model for Noise Robust Asr.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2019

A Comparison of Lattice-free Discriminative Training Criteria for Purely Sequence-trained Neural Network Acoustic Models.

[DOI]

,

Proceedings of the IEEE International Conference on Acoustics, 2019

Token-wise Training for Attention Based End-to-end Speech Recognition.

[DOI]

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2019

Investigating End-to-end Speech Recognition for Mandarin-english Code-switching.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2019

Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2019

Parametric Cepstral Mean Normalization for Robust Speech Recognition.

[DOI]

,

Gautam Bhattacharya

,

Proceedings of the IEEE International Conference on Acoustics, 2019

2018

Past review, current progress, and challenges ahead on the cocktail party problem.

[DOI]

,

,

,

,

Frontiers Inf. Technol. Electron. Eng., 2018

An Exploration of Directly Using Word as ACOUSTIC Modeling Unit for Speech Recognition.

[DOI]

,

,

,

,

Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018

Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions.

[DOI]

,

,

,

,

,

,

,

Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018

A Multistage Training Framework for Acoustic-to-Word Model.

[DOI]

,

,

,

,

Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition.

[DOI]

,

,

,

,

,

,

Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

2016

基于不可见字符的主副式网页信息隐藏算法 (Primary and Secondary Webpage Information Hiding Algorithm Based on Invisible Characters).

[DOI]

,

,

计算机科学, 2016

2015

Towards robust conversational speech recognition and understanding.

[DOI]

PhD thesis, 2015

Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition.

[DOI]

,

,

Michael L. Seltzer

,

IEEE ACM Trans. Audio Speech Lang. Process., 2015

Discriminative Training Using Non-Uniform Criteria for Keyword Spotting on Spontaneous Speech.

[DOI]

,

Biing-Hwang Fred Juang

IEEE ACM Trans. Audio Speech Lang. Process., 2015

2014

Latent semantic rational kernels for topic spotting on conversational speech.

[DOI]

,

David L. Thomson

,

Patrick Haffner

,

Biing-Hwang Juang

IEEE ACM Trans. Audio Speech Lang. Process., 2014

Beyond cross-entropy: towards better frame-level objective functions for deep neural network training in automatic speech recognition.

[DOI]

,

,

,

Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Feature space maximum a posteriori linear regression for adaptation of deep neural networks.

[DOI]

,

,

Sabato Marco Siniscalchi

,

,

,

Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Recurrent deep neural networks for robust speech recognition.

[DOI]

,

,

Shinji Watanabe

,

Biing-Hwang Fred Juang

Proceedings of the IEEE International Conference on Acoustics, 2014

Single-channel mixed speech recognition using deep neural networks.

[DOI]

,

,

Michael L. Seltzer

,

Proceedings of the IEEE International Conference on Acoustics, 2014

Deep learning vector quantization for acoustic information retrieval.

[DOI]

,

,

,

,

Proceedings of the IEEE International Conference on Acoustics, 2014

2013

Latent semantic rational kernels for topic spotting on spontaneous conversational speech.

[DOI]

,

Biing-Hwang Juang

Proceedings of the IEEE International Conference on Acoustics, 2013

Adaptive boosted non-uniform mce for keyword spotting on spontaneous speech.

[DOI]

,

Biing-Hwang Juang

Proceedings of the IEEE International Conference on Acoustics, 2013

2012

Discriminative Training Using Non-uniform Criteria for Keyword Spotting on Spontaneous Speech.

[DOI]

,

Biing-Hwang Juang

,

Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

A comparative study of discriminative training using non-uniform criteria for cross-layer acoustic modeling.

[DOI]

,

Biing-Hwang Juang

Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

2011

Recent development of discriminative training using non-uniform criteria for cross-level acoustic modeling.

[DOI]

,

Biing-Hwang Juang

Proceedings of the IEEE International Conference on Acoustics, 2011