Xie Chen

Orcid: 0000-0001-7423-617X

Affiliations:

Shanghai Jiao Tong University, China
Microsoft, Redmond, WA, USA (former)
University of Cambridge, UK (former)

According to our database¹, Xie Chen authored at least 116 papers between 2011 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Bibliography

2025

Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models.

[BibT_eX]

[DOI]

CoRR, January, 2025

Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model.

[BibT_eX]

[DOI]

CoRR, January, 2025

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization.

[BibT_eX]

[DOI]

CoRR, January, 2025

2024

E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

Advanced Long-Content Speech Recognition With Factorized Neural Transducer.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2024

Investigating Acoustic-Textual Emotional Inconsistency Information for Automatic Depression Detection.

[BibT_eX]

[DOI]

CoRR, 2024

Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective.

[BibT_eX]

[DOI]

CoRR, 2024

VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization.

[BibT_eX]

[DOI]

CoRR, 2024

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap.

[BibT_eX]

[DOI]

CoRR, 2024

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec.

[BibT_eX]

[DOI]

CoRR, 2024

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs.

[BibT_eX]

[DOI]

CoRR, 2024

DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning.

[BibT_eX]

[DOI]

CoRR, 2024

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching.

[BibT_eX]

[DOI]

CoRR, 2024

CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought.

[BibT_eX]

[DOI]

CoRR, 2024

Exploring SSL Discrete Tokens for Multilingual ASR.

[BibT_eX]

[DOI]

CoRR, 2024

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR.

[BibT_eX]

[DOI]

CoRR, 2024

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders.

[BibT_eX]

[DOI]

CoRR, 2024

Progressive Residual Extraction based Pre-training for Speech Representation Learning.

[BibT_eX]

[DOI]

CoRR, 2024

Language Model Can Listen While Speaking.

[BibT_eX]

[DOI]

CoRR, 2024

On the Effectiveness of Acoustic BPE in Decoder-Only TTS.

[BibT_eX]

[DOI]

CoRR, 2024

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers.

[BibT_eX]

[DOI]

CoRR, 2024

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement.

[BibT_eX]

[DOI]

CoRR, 2024

AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection.

[BibT_eX]

[DOI]

CoRR, 2024

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units.

[BibT_eX]

[DOI]

CoRR, 2024

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark.

[BibT_eX]

[DOI]

CoRR, 2024

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR.

[BibT_eX]

[DOI]

CoRR, 2024

MaLa-ASR: Multimedia-Assisted LLM-Based ASR.

[BibT_eX]

[DOI]

CoRR, 2024

GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting.

[BibT_eX]

[DOI]

CoRR, 2024

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge.

[BibT_eX]

[DOI]

CoRR, 2024

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity.

[BibT_eX]

[DOI]

CoRR, 2024

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech.

[BibT_eX]

[DOI]

CoRR, 2024

ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering.

[BibT_eX]

[DOI]

CoRR, 2024

CTC-Assisted LLM-Based Contextual ASR.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

Attention-Constrained Inference For Robust Decoder-Only Text-to-Speech.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

NDVQ: Robust Neural Audio Codec With Normal Distribution-Based Vector Quantization.

[BibT_eX]

[DOI]

Proceedings of the IEEE Spoken Language Technology Workshop, 2024

1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem.

[BibT_eX]

[DOI]

Proceedings of the Odyssey 2024: The Speaker and Language Recognition Workshop, 2024

MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, 2024

AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding.

[BibT_eX]

[DOI]

Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Improving Emotion Recognition with Pre-Trained Models, Multimodality, and Contextual Information.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

The X-Lance Technical Report for Interspeech 2024 Speech Processing using Discrete Speech Unit Challenge.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Symposium on Chinese Spoken Language Processing, 2024

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

BAT: Learning to Reason about Spatial Sounds with Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Improving Acoustic Scene Classification via Self-Supervised and Semi-Supervised Learning with Efficient Audio Transformer.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

Semi-Supervised Acoustic Scene Classification with Test-Time Adaptation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Acoustic BPE for Speech Generation with Discrete Tokens.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Leveraging Speech PTM, Text LLM, And Emotional TTS For Speech Emotion Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

VoiceFlow: Efficient Text-To-Speech with Rectified Flow Matching.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics, 2024

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023

Speaker Adaptive Text-to-Speech With Timbre-Normalized Vector-Quantized Feature.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2023

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations.

[BibT_eX]

[DOI]

CoRR, 2023

Improved Factorized Neural Transducer Model For text-only Domain Adaptation.

[BibT_eX]

[DOI]

Junzhe Liu

Jianwei Yu

Xie Chen

CoRR, 2023

Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer.

[BibT_eX]

[DOI]

CoRR, 2023

Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation.

[BibT_eX]

[DOI]

CoRR, 2023

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Multimedia, 2023

Unsupervised Active Learning: Optimizing Labeling Cost-Effectiveness for Automatic Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Blank-regularized CTC for Frame Skipping in Neural Transducer.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

An Adapter Based Multi-Label Pre-Training for Speech Separation and Enhancement.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Emodiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Factorized AED: Factorized Attention-Based Encoder-Decoder for Text-Only Domain Adaptive ASR.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Front-End Adapter: Adapting Front-End Input of Speech Based Self-Supervised Learning for Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Fast-Hubert: an Efficient Training Framework for Self-Supervised Speech Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

2022

Exploring Effective Fusion Algorithms for Speech Based Self-Supervised Learning Models.

[BibT_eX]

[DOI]

CoRR, 2022

Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Factorized Neural Transducer for Efficient Language Model Adaptation.

[BibT_eX]

[DOI]

Xie Chen

Zhong Meng

Sarangarajan Parthasarathy

Jinyu Li

Proceedings of the IEEE International Conference on Acoustics, 2022

2021

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition.

[BibT_eX]

[DOI]

Zhong Meng

Sarangarajan Parthasarathy

Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Memory-Efficient Pipeline-Parallel DNN Training.

[BibT_eX]

[DOI]

Proceedings of the 38th International Conference on Machine Learning, 2021

Internal Language Model Training for Domain-Adaptive End-To-End Speech Recognition.

[BibT_eX]

[DOI]

Zhong Meng

Naoyuki Kanda

Yashesh Gaur

Sarangarajan Parthasarathy

Proceedings of the IEEE International Conference on Acoustics, 2021

Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

2020

LSTM-LM with Long-Term History for First-Pass Decoding in Conversational Speech Recognition.

[BibT_eX]

[DOI]

Xie Chen

Sarangarajan Parthasarathy

William Gale

Shuangyu Chang

Michael Zeng

CoRR, 2020

Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019

Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition.

[BibT_eX]

[DOI]

Jeremy Heng Meng Wong

Mark J. F. Gales

IEEE ACM Trans. Audio Speech Lang. Process., 2019

Long-span language modeling for speech recognition.

[BibT_eX]

[DOI]

Sarangarajan Parthasarathy

CoRR, 2019

Recurrent Neural Network Language Model Training Using Natural Gradient.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2019

Gaussian Process Lstm Recurrent Neural Network Language Models for Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2019

Investigation of Sampling Techniques for Maximum Entropy Language Modeling Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2019

2018

Active Memory Networks for Language Modeling.

[BibT_eX]

[DOI]

Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Neural Network Language Modeling with Letter-Based Features and Importance Sampling.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Phonetic and Graphemic Systems for Multi-Genre Broadcast Transcription.

[BibT_eX]

[DOI]

Jeremy Heng Meng Wong

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Limited-Memory BFGS Optimization of Recurrent Neural Network Language Models for Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

The Effect of Adding Authorship Knowledge in Automated Text Scoring.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications@NAACL-HLT 2018, 2018

2017

Future Word Contexts in Neural Network Language Models.

[BibT_eX]

[DOI]

CoRR, 2017

Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Exploiting the Tibetan Radicals in Recurrent Neural Network for Low-Resource Language Models.

[BibT_eX]

[DOI]

Proceedings of the Neural Information Processing - 24th International Conference, 2017

Recurrent neural network language models for keyword search.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

Future word contexts in neural network language models.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

2016

Two Efficient Lattice Rescoring Methods Using Recurrent Neural Network Language Models.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2016

Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition.

[BibT_eX]

[DOI]

IEEE ACM Trans. Audio Speech Lang. Process., 2016

Multi-Language Neural Network Language Models.

[BibT_eX]

[DOI]

Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

CUED-RNNLM - An open-source toolkit for efficient training and evaluation of recurrent neural network language models.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

2015

Recurrent neural network language model adaptation for multi-genre broadcast speech recognition.

[BibT_eX]

[DOI]

Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Paraphrastic recurrent neural network language models.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Robust excitation-based features for Automatic Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Recurrent neural network language model training with noise contrastive estimation for speech recognition.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Improving the training and evaluation efficiency of recurrent neural network language models.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Investigation of back-off based interpolation between recurrent neural network and n-gram language models.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, 2015

2014

Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch.

[BibT_eX]

[DOI]

Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

An initial investigation of long-term adaptation for meeting transcription.

[BibT_eX]

[DOI]

Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Impact of single-microphone dereverberation on DNN-based meeting transcription systems.

[BibT_eX]

[DOI]

Takuya Yoshioka

Xie Chen

Mark J. F. Gales

Proceedings of the IEEE International Conference on Acoustics, 2014

Efficient lattice rescoring using recurrent neural network language models.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2014

2012

Pipelined Back-Propagation for Context-Dependent Deep Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

2011

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription.

[BibT_eX]

[DOI]

Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, 2011

Xie Chen

Timeline

Legend:

Links

Online presence:

On csauthors.net:

Bibliography

Loading...