Lei Xie

Orcid: 0000-0001-8234-0823

Affiliations:
  • Northwestern Polytechnical University, School of Computer Science, Xi'an, China
  • The Chinese University of Hong Kong, Department of Systems Engineering and Engineering Management, Hong Kong (2006 - 2007)
  • City University of Hong Kong, School of Creative Media, Hong Kong (2004 - 2006)
  • Northwestern Polytechnical University, Xi'an, China (PhD 2004)
  • Vrije Universiteit Brussel, Department of Electronics and Information Processing, Belgium (2001 - 2002)


According to our database1, Lei Xie authored at least 404 papers between 2005 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

Distinctive and Natural Speaker Anonymization via Singular Value Transformation-Assisted Matrix.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

Conversational Speech Recognition by Learning Audio-Textual Cross-Modal Contextual Representation.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

Multi-Level Temporal-Channel Speaker Retrieval for Zero-Shot Voice Conversion.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

U-Style: Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

StreamVoice+: Evolving Into End-to-End Streaming Zero-Shot Voice Conversion.
IEEE Signal Process. Lett., 2024

MMGER: Multi-Modal and Multi-Granularity Generative Error Correction With LLM for Joint Accent and Speech Recognition.
IEEE Signal Process. Lett., 2024

Distil-DCCRN: A Small-Footprint DCCRN Leveraging Feature-Based Knowledge Distillation in Speech Enhancement.
IEEE Signal Process. Lett., 2024

Whisper-SV: Adapting Whisper for low-data-resource speaker verification.
Speech Commun., 2024

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM.
CoRR, 2024

The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge: Tasks, Results and Findings.
CoRR, 2024

The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge.
CoRR, 2024

NTU-NPU System for Voice Privacy 2024 Challenge.
CoRR, 2024

Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling.
CoRR, 2024

Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text.
CoRR, 2024

Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge.
CoRR, 2024

Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge.
CoRR, 2024

Drop the beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation.
CoRR, 2024

Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper.
CoRR, 2024

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition.
CoRR, 2024

The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024.
CoRR, 2024

Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification.
CoRR, 2024

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study.
CoRR, 2024

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy.
CoRR, 2024

FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter.
CoRR, 2024

RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attention.
CoRR, 2024

AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection.
CoRR, 2024

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling.
CoRR, 2024

MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition.
CoRR, 2024

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets.
CoRR, 2024

The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023.
CoRR, 2024

E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models.
CoRR, 2024

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment.
Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, 2024

UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

Bs-Plcnet: Band-Split Packet Loss Concealment Network with Multi-Task Learning Framework and Multi-Discriminators.
Proceedings of the IEEE International Conference on Acoustics, 2024

Promptvc: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts.
Proceedings of the IEEE International Conference on Acoustics, 2024

SELM: Speech Enhancement using Discrete Tokens and Language Models.
Proceedings of the IEEE International Conference on Acoustics, 2024

MLCA-AVSR: Multi-Layer Cross Attention Fusion Based Audio-Visual Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2024

ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2024

Dualvc 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion.
Proceedings of the IEEE International Conference on Acoustics, 2024

Automatic Channel Selection and Spatial Feature Integration for Multi-Channel Speech Recognition Across Various Array Topologies.
Proceedings of the IEEE International Conference on Acoustics, 2024

Rad-Net: A Repairing and Denoising Network for Speech Signal Improvement.
Proceedings of the IEEE International Conference on Acoustics, 2024

An Audio-Quality-Based Multi-Strategy Approach For Target Speaker Extraction in the Misp 2023 Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2024

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

2023
A novel evolutionary algorithm inspired from triangle search and its applications on parameters identification of photovoltaic models.
Soft Comput., October, 2023

Neural speech enhancement with unsupervised pre-training and mixture training.
Neural Networks, January, 2023

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement.
IEEE Trans. Multim., 2023

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Persons.
IEEE Trans. Multim., 2023

Timbre-Reserved Adversarial Attack in Speaker Identification.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

MSM-VC: High-Fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-Scale Style Modeling.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech - A Study Between English and Mandarin.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

LM-VC: Zero-Shot Voice Conversion via Speech Generation Based on Language Models.
IEEE Signal Process. Lett., 2023

Accent-VITS: accent transfer for end-to-end TTS.
CoRR, 2023

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization.
CoRR, 2023

SponTTS: modeling and transferring spontaneous style for TTS.
CoRR, 2023

Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning.
CoRR, 2023

Vec-Tok Speech: speech vectorization and tokenization for neural speech generation.
CoRR, 2023

Timbre-reserved Adversarial Attack in Speaker Identification.
CoRR, 2023

The FlySpeech Audio-Visual Speaker Diarization System for MISP Challenge 2022.
CoRR, 2023

Multi-level Temporal-channel Speaker Retrieval for Robust Zero-shot Voice Conversion.
CoRR, 2023

Distance-based Weight Transfer from Near-field to Far-field Speaker Verification.
CoRR, 2023

The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task.
Proceedings of the 20th International Conference on Spoken Language Translation, 2023

VISinger2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Two Stage Contextual Word Filtering for Context Bias in Unified Streaming and Non-streaming Transducer.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

DualVC: Dual-mode Voice Conversion using Intra-model Knowledge Distillation and Hybrid Predictive Coding.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

DCCRN-KWS: An Audio Bias Based Model for Noise Robust Small-Footprint Keyword Spotting.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling.
Proceedings of the IEEE International Conference on Acoustics, 2023

Two-Step Band-Split Neural Network Approach For Full-Band Residual Echo Suppression.
Proceedings of the IEEE International Conference on Acoustics, 2023

Distance-Based Weight Transfer for Fine-Tuning From Near-Field to Far-Field Speaker Verification.
Proceedings of the IEEE International Conference on Acoustics, 2023

VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting.
Proceedings of the IEEE International Conference on Acoustics, 2023

Distinguishable Speaker Anonymization Based on Formant and Fundamental Frequency Scaling.
Proceedings of the IEEE International Conference on Acoustics, 2023

Preserving Background Sound in Noise-Robust Voice Conversion Via Multi-Task Learning.
Proceedings of the IEEE International Conference on Acoustics, 2023

The NPU-Elevoc Personalized Speech Enhancement System for Icassp2023 DNS Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2023

Wekws: A Production First Small-Footprint End-to-End Keyword Spotting Toolkit.
Proceedings of the IEEE International Conference on Acoustics, 2023

Delivering Speaking Style in Low-Resource Voice Conversion with Multi-Factor Constraints.
Proceedings of the IEEE International Conference on Acoustics, 2023

DSPGAN: A Gan-Based Universal Vocoder for High-Fidelity TTS by Time-Frequency Domain Supervision from DSP.
Proceedings of the IEEE International Conference on Acoustics, 2023

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features.
Proceedings of the IEEE International Conference on Acoustics, 2023

Two-Stage Neural Network for ICASSP 2023 Speech Signal Improvement Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2023

The NPU-ASLP System for Deepfake Algorithm Recognition in ADD 2023 Challenge.
Proceedings of the Workshop on Deepfake Audio Detection and Analysis co-located with 32th International Joint Conference on Artificial Intelligence (IJCAI 2023), 2023

The Xiaomi-ASLP Text-to-speech System for Blizzard Challenge 2023.
Proceedings of the 18th Blizzard Challenge Workshop, Grenoble, France, August 29, 2023, 2023

U2-KWS: Unified Two-Pass Open-Vocabulary Keyword Spotting with Keyword Bias.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

An Exploration of Task-Decoupling on Two-Stage Neural Post Filter for Real-Time Personalized Acoustic Echo Cancellation.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Promptspeaker: Speaker Generation Based on Text Descriptions.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

MBTFNET: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Salt: Distinguishable Speaker Anonymization Through Latent Space Transformation.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2MeT 2.0): A Benchmark for Speaker-Attributed ASR.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesis.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Spike-Triggered Contextual Biasing for End-to-End Mandarin Speech Recognition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

BA-MoE: Boundary-Aware Mixture-of-Experts Adapter for Code-Switching Speech Recognition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS.
Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2023

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
ParaTTS: Learning Linguistic and Prosodic Cross-Sentence Information in Paragraph-Based TTS.
IEEE ACM Trans. Audio Speech Lang. Process., 2022

Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis.
IEEE ACM Trans. Audio Speech Lang. Process., 2022

MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis.
IEEE ACM Trans. Audio Speech Lang. Process., 2022

Disentangling Style and Speaker Attributes for TTS Style Transfer.
IEEE ACM Trans. Audio Speech Lang. Process., 2022

Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis.
IEEE Signal Process. Lett., 2022

Improving data augmentation for low resource speech-to-text translation with diverse paraphrasing.
Neural Networks, 2022

Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution.
Neural Networks, 2022

Noise-robust voice conversion with domain adversarial training.
Neural Networks, 2022

MSV Challenge 2022: NPU-HC Speaker Verification System for Low-resource Indian Languages.
CoRR, 2022

TESSP: Text-Enhanced Self-Supervised Speech Pre-training.
CoRR, 2022

VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer.
CoRR, 2022

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge.
CoRR, 2022

MFCCA: Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario.
CoRR, 2022

NWPU-ASLP System for the VoicePrivacy 2022 Challenge.
CoRR, 2022

An Audio-Visual Attention Based Multimodal Network for Fake Talking Face Videos Detection.
CoRR, 2022

Attention-Based Lip Audio-Visual Synthesis for Talking Face Generation in the Wild.
CoRR, 2022

Audio-visual speech separation based on joint feature representation with cross-modal attention.
CoRR, 2022

IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion.
CoRR, 2022

MFCCA:Multi-Frame Cross-Channel Attention for Multi-Speaker ASR in Multi-Party Meeting Scenario.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Spatial-DCCRN: DCCRN Equipped with Frame-Level Angle Feature and Hybrid Filtering for Multi-Channel Speech Enhancement.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

TEA-PSE 2.0: Sub-Band Network for Real-Time Personalized Speech Enhancement.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

The ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge (ICSRC): Dataset, Tracks, Baseline and Results.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

End-to-End Voice Conversion with Information Perturbation.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Personalized Acoustic Echo Cancellation for Full-duplex Communications.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Backend Ensemble for Speaker Verification and Spoofing Countermeasure.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Minimizing Sequential Confusion Error in Speech Command Recognition.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Multi-Task Deep Residual Echo Suppression with Echo-Aware Loss.
Proceedings of the IEEE International Conference on Acoustics, 2022

WENETSPEECH: A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2022

VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2022

Summary on the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2022

M2Met: The Icassp 2022 Multi-Channel Multi-Party Meeting Transcription Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2022

Conversational Speech Recognition by Learning Conversation-Level Characteristics.
Proceedings of the IEEE International Conference on Acoustics, 2022

One-Shot Voice Conversion For Style Transfer Based On Speaker Adaptation.
Proceedings of the IEEE International Conference on Acoustics, 2022

S-DCCRN: Super Wide Band DCCRN with Learnable Complex Feature for Speech Enhancement.
Proceedings of the IEEE International Conference on Acoustics, 2022

TEA-PSE: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System for ICASSP 2022 DNS Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2022

Uformer: A Unet Based Dilated Complex & Real Dual-Path Conformer Network for Simultaneous Speech Enhancement and Dereverberation.
Proceedings of the IEEE International Conference on Acoustics, 2022

2021
LET-Decoder: A WFST-Based Lazy-Evaluation Token-Group Decoder With Exact Lattice Generation.
IEEE Signal Process. Lett., 2021

Factorized WaveNet for voice conversion with limited data.
Speech Commun., 2021

Cycle consistent network for end-to-end style transfer TTS training.
Neural Networks, 2021

Effective and direct control of neural TTS prosody by removing interactions between different attributes.
Neural Networks, 2021

Controllable cross-speaker emotion transfer for end-to-end speech synthesis.
CoRR, 2021

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person.
CoRR, 2021

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder.
CoRR, 2021

Efficient Conformer with Prob-Sparse Attention Mechanism for End-to-EndSpeech Recognition.
CoRR, 2021

INTERSPEECH 2021 ConferencingSpeech Challenge: Towards Far-field Multi-Channel Speech Enhancement for Video Conferencing.
CoRR, 2021

The NPU System for the 2020 Personalized Voice Trigger Challenge.
CoRR, 2021

WeNet: Production First and Production Ready End-to-End Speech Recognition Toolkit.
CoRR, 2021

The SLT 2021 Children Speech Recognition Challenge: Open Datasets, Rules and Baselines.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Learn2Sing: Target Speaker Singing Voice Synthesis by Learning from a Singing Teacher.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Cascade RNN-Transducer: Syllable Based Streaming On-Device Mandarin Speech Recognition with a Syllable-To-Character Converter.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Simplified Self-Attention for Transformer-Based end-to-end Speech Recognition.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Conversational End-to-End TTS for Voice Agents.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

IEEE SLT 2021 Alpha-Mini Speech Challenge: Open Datasets, Tracks, Rules and Baselines.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

DESNet: A Multi-Channel Network for Simultaneous Speech Dereverberation, Enhancement and Separation.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Optimizing Voice Conversion Network with Cycle Consistency Loss of Speaker Identity.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Context-aware RNNLM Rescoring for Conversational Speech Recognition.
Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Adversarial Training for Multi-domain Speaker Recognition.
Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Accent and Speaker Disentanglement in Many-to-many Voice Conversion.
Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Controllable Emotion Transfer For End-to-End Speech Synthesis.
Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Multi-Level Transfer Learning from Near-Field to Far-Field Speaker Verification.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

F-T-LSTM Based Complex Network for Joint Acoustic Echo Cancellation and Speech Enhancement.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Enriching Source Style Transfer in Recognition-Synthesis Based Non-Parallel Voice Conversion.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Efficient Conformer with Prob-Sparse Attention Mechanism for End-to-End Speech Recognition.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Controllable Context-Aware Conversational Speech Synthesis.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Improving Performance of Seen and Unseen Speech Style Transfer in End-to-End Neural TTS.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder.
Proceedings of the ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction, Montreal, QC, Canada, October 18, 2021

Efficient Gradient-Based Neural Architecture Search For End-to-End ASR.
Proceedings of the ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction, Montreal, QC, Canada, October 18, 2021

ASMMC21: The 6th International Workshop on Affective Social Multimedia Computing.
Proceedings of the ICMI '21: International Conference on Multimodal Interaction, 2021

A Web-Based Longitudinal Mental Health Monitoring System.
Proceedings of the ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction, Montreal, QC, Canada, October 18, 2021

TeNC: Low Bit-Rate Speech Coding with VQ-VAE and GAN.
Proceedings of the ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction, Montreal, QC, Canada, October 18, 2021

The Multi-Speaker Multi-Style Voice Cloning Challenge 2021.
Proceedings of the IEEE International Conference on Acoustics, 2021

Wake Word Detection with Streaming Transformers.
Proceedings of the IEEE International Conference on Acoustics, 2021

The Accented English Speech Recognition Challenge 2020: Open Datasets, Tracks, Baselines, Results and Methods.
Proceedings of the IEEE International Conference on Acoustics, 2021

An Asynchronous WFST-Based Decoder for Automatic Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2021

Duality Temporal-Channel-Frequency Attention Enhanced Speaker Representation Learning.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Conferencingspeech Challenge: Towards Far-Field Multi-Channel Speech Enhancement for Video Conferencing.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Incorporating Typological Features into Language Selection for Multilingual Neural Machine Translation.
Proceedings of the Web and Big Data - 5th International Joint Conference, 2021

Target Speaker Extraction for Customizable Query-by-Example Keyword Spotting.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2021

2020
Improving Adversarial Neural Machine Translation for Morphologically Rich Language.
IEEE Trans. Emerg. Top. Comput. Intell., 2020

Fast Query-by-Example Speech Search Using Attention-Based Deep Binary Embeddings.
IEEE ACM Trans. Audio Speech Lang. Process., 2020

Loanword Identification in Low-Resource Languages with Minimal Supervision.
ACM Trans. Asian Low Resour. Lang. Inf. Process., 2020

Adversarial Feature Learning and Unsupervised Clustering Based Speech Synthesis for Found Data With Acoustic and Textual Noise.
IEEE Signal Process. Lett., 2020

On the localness modeling for the self-attention based end-to-end speech synthesis.
Neural Networks, 2020

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition.
CoRR, 2020

Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training.
CoRR, 2020

Conversational End-to-End TTS for Voice Agent.
CoRR, 2020

Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals.
Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

NPU Speaker Verification System for INTERSPEECH 2020 Far-Field Speaker Verification Challenge.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

An End-to-End Architecture of Online Multi-Channel Speech Separation.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Inaudible Adversarial Perturbations for Targeted Attack in Speaker Recognition.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Wake Word Detection with Alignment-Free Lattice-Free MMI.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Channel-Wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Mining Effective Negative Training Samples for Keyword Spotting.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Time-Domain Neural Network Approach for Speech Bandwidth Extension.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Effective Wavenet Adaptation for Voice Conversion with Limited Data.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

The NUS & NWPU system for Voice Conversion Challenge 2020.
Proceedings of the Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020

2019
Adversarial Regularization for Attention Based End-to-End Robust Speech Recognition.
IEEE ACM Trans. Audio Speech Lang. Process., 2019

Region Proposal Network Based Small-Footprint Keyword Spotting.
IEEE Signal Process. Lett., 2019

Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis.
IEEE Access, 2019

Query-by-Example Speech Search Using Recurrent Neural Acoustic Word Embeddings With Temporal Context.
IEEE Access, 2019

Towards Language-Universal Mandarin-English Speech Recognition.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Building a Mixed-Lingual Neural TTS System with Only Monolingual Data.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Improved Speaker-Dependent Separation for CHiME-5 Challenge.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Adversarial Regularization for End-to-End Robust Speaker Verification.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Exploiting Syntactic Features in a Parsed Tree to Improve End-to-End TTS.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

A New GAN-Based End-to-End TTS Training Algorithm.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Unsupervised Adaptation with Adversarial Dropout Regularization for Robust Speech Recognition.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Deep Audio-visual System for Closed-set Word-level Speech Recognition.
Proceedings of the International Conference on Multimodal Interaction, 2019

Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization.
Proceedings of the IEEE International Conference on Acoustics, 2019

Enhancing Hybrid Self-attention Structure with Relative-position-aware Bias for Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2019

A Pitch-aware Approach to Single-channel Speech Separation.
Proceedings of the IEEE International Conference on Acoustics, 2019

Adversarial Examples for Improving End-to-end Attention-based Small-footprint Keyword Spotting.
Proceedings of the IEEE International Conference on Acoustics, 2019

Investigating End-to-end Speech Recognition for Mandarin-english Code-switching.
Proceedings of the IEEE International Conference on Acoustics, 2019

Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System.
Proceedings of the IEEE International Conference on Acoustics, 2019

Domain Adversarial Training for Improving Keyword Spotting Performance of ESL Speech.
Proceedings of the IEEE International Conference on Acoustics, 2019

An Attention-based Neural Network Approach for Single Channel Speech Enhancement.
Proceedings of the IEEE International Conference on Acoustics, 2019

SZ-NPU Team's Entry to Blizzard Challenge 2019.
Proceedings of the Blizzard Challenge 2019, Vienna, Austria, September 23, 2019, 2019

The Mobvoi Text-To-Speech System for Blizzard Challenge 2019.
Proceedings of the Blizzard Challenge 2019, Vienna, Austria, September 23, 2019, 2019

Controlling Emotion Strength with Relative Attribute for End-to-End Speech Synthesis.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Verifying Deep Keyword Spotting Detection with Acoustic Word Embeddings.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Improving Mandarin End-to-End Speech Synthesis by Self-Attention and Learnable Gaussian Bias.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Time Domain Audio Visual Speech Separation.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Virtual Adversarial Training for DS-CNN Based Small-Footprint Keyword Spotting.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

WaveNet Factorization with Singular Value Decomposition for Voice Conversion.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Incremental Lattice Determinization for WFST Decoders.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Exploring RNN-Transducer for Chinese speech recognition.
Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019

Multiple fixed beamformers with a spacial Wiener-form postfilter for far-field speech recognition.
Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019

2018
A Bidirectional LSTM Approach with Word Embeddings for Sentence Boundary Detection.
J. Signal Process. Syst., 2018

Guest Editorial: Advances in Deep Learning for Speech Processing.
J. Signal Process. Syst., 2018

Learning distributed sentence representations for story segmentation.
Signal Process., 2018

Unsupervised measure of Chinese lexical semantic similarity using correlated graph model for news story segmentation.
Neurocomputing, 2018

ASMMC-MMAC 2018: The Joint Workshop of 4th the Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data Workshop.
Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference, 2018

A Refined Query-by-Example Approach to Spoken-Term-Detection on ESL learners' Speech.
Proceedings of the 11th International Symposium on Chinese Spoken Language Processing, 2018

Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Empirical Evaluation of Speaker Adaptation on DNN Based Acoustic Model.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Investigating Generative Adversarial Networks Based Speech Dereverberation for Robust Speech Recognition.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Training Augmentation with Adversarial Examples for Robust Speech Recognition.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Attention-based End-to-End Models for Small-Footprint Keyword Spotting.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Unsupervised Domain Adaptation via Domain Adversarial Training for Speaker Recognition.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Domain Adversarial Training for Accented Speech Recognition.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Attention-Based End-to-End Speech Recognition on Voice Search.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

The I2R-NWPU-NUS Text-to-Speech System for Blizzard Challenge 2018.
Proceedings of the Blizzard Challenge 2018, Hyderabad, India, September 8, 2018, 2018

Self-validated Story Segmentation of Chinese Broadcast News.
Proceedings of the Advances in Brain Inspired Cognitive Systems, 2018

2017
Modeling Latent Topics and Temporal Distance for Story Segmentation of Broadcast News.
IEEE ACM Trans. Audio Speech Lang. Process., 2017

Multitask Feature Learning for Low-Resource Query-by-Example Spoken Term Detection.
IEEE J. Sel. Top. Signal Process., 2017

Online object tracking based on BLSTM-RNN with contextual-sequential labeling.
J. Ambient Intell. Humaniz. Comput., 2017

A hybrid neural network hidden Markov model approach for automatic story segmentation.
J. Ambient Intell. Humaniz. Comput., 2017

Media computing and applications for immersive communications: recent advances.
J. Ambient Intell. Humaniz. Comput., 2017

An unsupervised deep domain adaptation approach for robust speech recognition.
Neurocomputing, 2017

Sound image externalization for headphone based real-time 3D audio.
Frontiers Comput. Sci., 2017

Introduction to special section on advances of orange technologies.
Frontiers Comput. Sci., 2017

Attention-Based End-to-End Speech Recognition in Mandarin.
CoRR, 2017

Denoising Recurrent Neural Network for Deep Bidirectional LSTM Based Voice Conversion.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Empirical Evaluation of Parallel Training Algorithms on Acoustic Modeling.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Pairwise learning using multi-lingual bottleneck features for low-resource query-by-example spoken term detection.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

The I2R-NWPU Text-to-Speech System for Blizzard Challenge 2017.
Proceedings of the Blizzard Challenge 2017, Stockholm, Sweden, August 25, 2017, 2017

Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation.
Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework.
Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

Multilingual bottle-neck feature learning from untranscribed speech.
Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

Topic embedding of sentences for story segmentation.
Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2017

An end-to-end neural network approach to story segmentation.
Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2017

A segmental DNN/i-vector approach for digit-prompted speaker verification.
Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2017

Frequency-invariant differential microphone array design in the STFT domain.
Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2017

2016
Real-time tracking-by-learning with high-order regularization fusion for big video abstraction.
Signal Process., 2016

Guest Editorial: Immersive Audio/Visual Systems.
Multim. Tools Appl., 2016

A deep bidirectional LSTM approach for video-realistic talking head.
Multim. Tools Appl., 2016

Deformable object tracking with spatiotemporal segmentation in big vision surveillance.
Neurocomputing, 2016

On the impact of phoneme alignment in DNN-based speech synthesis.
Proceedings of the 9th ISCA Speech Synthesis Workshop, 2016

An Automatic Voice Conversion Evaluation Strategy Based on Perceptual Background Noise Distortion and Speaker Similarity.
Proceedings of the 9th ISCA Speech Synthesis Workshop, 2016

The NNI Vietnamese Speech Recognition System for MediaEval 2016.
Proceedings of the Working Notes Proceedings of the MediaEval 2016 Workshop, 2016

Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in Mandarin Chinese.
Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, 2016

Learning Neural Network Representations Using Cross-Lingual Bottleneck Features with Word-Pair Information.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

A DNN-HMM Approach to Story Segmentation.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Toward High-Performance Language-Independent Query-by-Example Spoken Term Detection for MediaEval 2015: Post-Evaluation Analysis.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Deep neural network derived bottleneck features for accurate audio classification.
Proceedings of the 2016 IEEE International Conference on Multimedia & Expo Workshops, 2016

Approximate search of audio queries by using DTW with phone time boundary and data augmentation.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Exemplar-based sparse representation of timbre and prosody for voice conversion.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

The I2R-NWPU-NTU Text-to-Speech System at Blizzard Challenge 2016.
Proceedings of the Blizzard Challenge 2016, Cuppertino, CA, USA, September 16, 2016, 2016

On the training of DNN-based average voice model for speech synthesis.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2016

On the use of I-vectors and average voice model for voice conversion without parallel data.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2016

Predicting articulatory movement from text using deep architecture with stacked bottleneck features.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2016

Study on near-field crosstalk cancellation based on least square algorithm.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2016

2015
Tennis Ball Tracking Using a Two-Layered Data Association Approach.
IEEE Trans. Multim., 2015

Multiple pedestrian tracking based on couple-states Markov chain with semantic topic learning for video surveillance.
Soft Comput., 2015

Topic modeling in multimedia: algorithms and applications.
Soft Comput., 2015

NestDE: generic parameters tuning for automatic story segmentation.
Soft Comput., 2015

Topic segmentation on spoken documents using self-validated acoustic cuts.
Soft Comput., 2015

Expressive talking avatar synthesis and animation.
Multim. Tools Appl., 2015

Head motion synthesis from speech using deep neural networks.
Multim. Tools Appl., 2015

Online Object Tracking Based on CNN with Metropolis-Hasting Re-Sampling.
Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM '15, Brisbane, Australia, October 26, 2015

The NNI Query-by-Example System for MediaEval 2015.
Proceedings of the Working Notes Proceedings of the MediaEval 2015 Workshop, 2015

Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Regularized non-negative matrix factorization using alternating direction method of multipliers and its application to source separation.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

An alternating optimization approach for phase retrieval.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

BLSTM neural networks for speech driven head motion synthesis.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Parallel inference of dirichlet process Gaussian mixture models for unsupervised acoustic modeling: a feasibility study.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Language independent query-by-example spoken term detection using N-best phone sequences and partial matching.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Photo-real talking head with deep bidirectional LSTM.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features.
Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, 2015

Non-negative matrix factorization using stable alternating direction method of multipliers for source separation.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2015

A density peak clustering approach to unsupervised acoustic subword units discovery.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2015

A waveform representation framework for high-quality statistical parametric speech synthesis.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2015

Fundamental frequency modeling using wavelets for emotional voice conversion.
Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction, 2015

2014
A statistical parametric approach to video-realistic text-driven talking avatar.
Multim. Tools Appl., 2014

Multimodal joint information processing in human machine interaction: recent advances.
Multim. Tools Appl., 2014

The NNI Query-by-Example System for MediaEval 2014.
Proceedings of the Working Notes Proceedings of the MediaEval 2014 Workshop, 2014

A hybrid virtual bass system with improved phase vocoder and high efficiency.
Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Experimental study on dereverberation and noise reduction for distant speech recognition.
Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Intrinsic spectral analysis based on temporal context features for query-by-example spoken term detection.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

A deep neural network approach for sentence boundary detection in broadcast news.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Stereo acoustic echo suppression using widely linear filtering in the frequency domain.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Speech-driven head motion synthesis using neural networks.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

An ensemble of deep neural networks for object tracking.
Proceedings of the 2014 IEEE International Conference on Image Processing, 2014

Unsupervised broadcast news story segmentation using distance dependent Chinese restaurant processes.
Proceedings of the IEEE International Conference on Acoustics, 2014

Sentence boundary detection in chinese broadcast news using conditional random fields and prosodic features.
Proceedings of the IEEE China Summit & International Conference on Signal and Information Processing, 2014

Learning optimal features for music transcription.
Proceedings of the IEEE China Summit & International Conference on Signal and Information Processing, 2014

Multimodal continuous affect recognition based on LSTM and multiple kernel learning.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2014

Multi-view features in a DNN-CRF model for improved sentence unit detection on English broadcast news.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2014

2013
A two layered data association approach for ball tracking.
Proceedings of the IEEE International Conference on Acoustics, 2013

A tighter lower bound estimate for dynamic time warping.
Proceedings of the IEEE International Conference on Acoustics, 2013

Measuring semantic similarity by contextualword connections in Chinese news story segmentation.
Proceedings of the IEEE International Conference on Acoustics, 2013

Broadcast news story segmentation using latent topics on data manifold.
Proceedings of the IEEE International Conference on Acoustics, 2013

Numerical calculation of the head-related transfer functions with Chinese dummy head.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013

Context-dependent deep neural networks for commercial Mandarin speech recognition applications.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013

Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions.
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013

2012
Laplacian Eigenmaps for Automatic Story Segmentation of Broadcast News.
IEEE Trans. Speech Audio Process., 2012

Broadcast News Story Segmentation Using Conditional Random Fields and Multimodal Features.
IEICE Trans. Inf. Syst., 2012

Mask Estimation and Refinement for MFT-based Robust Speaker Verification.
Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

Speech Pattern Discovery using Audio-Visual Fusion and Canonical Correlation Analysis.
Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

Lexical Story Co-Segmentation of Chinese Broadcast News.
Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

Acoustic TextTiling for story segmentation of spoken documents.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

Detection of ball hits in a tennis game using audio and visual information.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2012

2011
Pitch-density-based features and an SVM binary tree approach for multi-class audio classification in broadcast news.
Multim. Syst., 2011

On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news.
Inf. Sci., 2011

Probabilistic Latent Semantic Analysis for Broadcast News Story Segmentation.
Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011

2010
Cascade Markov random fields for stroke extraction of Chinese characters.
Inf. Sci., 2010

Minimizing the expected complete influence time of a social network.
Inf. Sci., 2010

Speech and Auditory Interfaces for Ubiquitous, Immersive and Personalized Applications.
Proceedings of the Symposia and Workshops on Ubiquitous, 2010

Multi-modal feature integration for story boundary detection in broadcast news.
Proceedings of the 7th International Symposium on Chinese Spoken Language Processing, 2010

Dual-microphone noise reduction based on semi-blind DUET.
Proceedings of the 7th International Symposium on Chinese Spoken Language Processing, 2010

Phoneme lattice based texttiling towards multilingual story segmentation.
Proceedings of the 11th Annual Conference of the International Speech Communication Association, 2010

Maximum lexical cohesion for fine-grained news story segmentation.
Proceedings of the 11th Annual Conference of the International Speech Communication Association, 2010

2009
Audio-visual human recognition using semi-supervised spectral learning and hidden Markov models.
J. Vis. Lang. Comput., 2009

Noise robust features for speech/music discrimination in real-time telecommunication.
Proceedings of the 2009 IEEE International Conference on Multimedia and Expo, 2009

A Subword Normalized Cut Approach to Automatic Story Segmentation of Chinese Broadcast News.
Proceedings of the Information Retrieval Technology, 2009

Multicue Graph Mincut for Image Segmentation.
Proceedings of the Computer Vision, 2009

2008
Type-2 fuzzy Gaussian mixture models.
Pattern Recognit., 2008

Subword Lexical Chaining for Automatic Story Segmentation in Chinese Broadcast News.
Proceedings of the Advances in Multimedia Information Processing, 2008

Subword Latent Semantic Analysis for Texttiling-Based Automatic Story Segmentation of Chinese Broadcast News.
Proceedings of the 6th International Symposium on Chinese Spoken Language Processing, 2008

A Heuristic Approach to Caption Enhancement for Effective Video OCR.
Proceedings of the Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, 2008

Multi-Scale TextTiling for Automatic Story Segmentation in Chinese Broadcast News.
Proceedings of the Information Retrieval Technology, 2008

2007
Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling.
IEEE Trans. Multim., 2007

A coupled HMM approach to video-realistic speech animation.
Pattern Recognit., 2007

Combined Use of Speaker- and Tone-Normalized Pitch Reset with Pause Duration for Automatic Story Segmentation in Mandarin Broadcast News.
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, 2007

Modeling the statistical behavior of lexical chains to capture word cohesiveness for automatic story segmentation.
Proceedings of the 8th Annual Conference of the International Speech Communication Association, 2007

2006
2D/3D Web Visualization on Mobile Devices.
Proceedings of the Web Information Systems, 2006

Lip Assistant: Visualize Speech for Hearing Impaired People in Multimedia Services.
Proceedings of the IEEE International Conference on Systems, 2006

The SOMN-HMM Model and Its Application to Automatic Synthesis of 3D Character Animations.
Proceedings of the IEEE International Conference on Systems, 2006

Supervised Learning of Motion Style for Real-time Synthesis of 3D Character Animations.
Proceedings of the IEEE International Conference on Systems, 2006

A Cantonese Speech-Driven Talking Face Using Translingual Audio-to-Visual Conversion.
Proceedings of the Chinese Spoken Language Processing, 5th International Symposium, 2006

Speech Animation Using Coupled Hidden Markov Models.
Proceedings of the 18th International Conference on Pattern Recognition (ICPR 2006), 2006

An Articulatory Approach to Video-Realistic Mouth Animation.
Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing, 2006

2005
Multi-stream Articulator Model with Adaptive Reliability Measure for Audio Visual Speech Recognition.
Proceedings of the Advances in Machine Learning and Cybernetics, 2005


  Loading...