Chao Zhang

Orcid: 0000-0002-7730-5131

Affiliations:
  • Tsinghua University, Department of Electronic Engineering, Beijing, China
  • University of Cambridge, Department of Engineering, UK (PhD 2017)


According to our database1, Chao Zhang authored at least 113 papers between 2011 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Knowledge-aware audio-grounded generative slot filling for limited annotated data.
Comput. Speech Lang., 2025

2024
Graph Neural Networks for Contextual ASR With the Tree-Constrained Pointer Generator.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

Cross-Utterance Conditioned VAE for Speech Generation.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events.
CoRR, 2024

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation.
CoRR, 2024

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition.
CoRR, 2024

Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models.
CoRR, 2024

Speaker Adaptation for Quantised End-to-End ASR Models.
CoRR, 2024

Confidence Estimation for Automatic Detection of Depression and Alzheimer's Disease Based on Clinical Interviews.
CoRR, 2024

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization.
CoRR, 2024

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR.
CoRR, 2024

Can Large Language Models Understand Spatial Audio?
CoRR, 2024

An Improved Empirical Fisher Approximation for Natural Gradient Descent.
CoRR, 2024

Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models.
CoRR, 2024

Bayesian WeakS-to-Strong from Text Classification to Generation.
CoRR, 2024

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback.
CoRR, 2024

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models.
CoRR, 2024

CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models.
CoRR, 2024

M<sup>3</sup>AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset.
CoRR, 2024

Affect Recognition in Conversations Using Large Language Models.
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2024

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

SALMONN: Towards Generic Hearing Abilities for Large Language Models.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Enhancing Quantised End-to-End ASR Models Via Personalisation.
Proceedings of the IEEE International Conference on Acoustics, 2024

Connecting Speech Encoder and Large Language Model for ASR.
Proceedings of the IEEE International Conference on Acoustics, 2024

Can Whisper Perform Speech-Based In-Context Learning?
Proceedings of the IEEE International Conference on Acoustics, 2024

Extending Large Language Models for Speech and Audio Captioning.
Proceedings of the IEEE International Conference on Acoustics, 2024

Bridging the Gap: Integrating Pre-Trained Speech Enhancement and Recognition Models for Robust Speech Recognition.
Proceedings of the 32nd European Signal Processing Conference, 2024

Bayesian Example Selection Improves In-Context Learning for Speech, Text and Visual Modalities.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Handling Ambiguity in Emotion: From Out-of-Domain Detection to Distribution Estimation.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

Modelling Variability in Human Annotator Simulation.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

Speech-based Slot Filling using Large Language Models.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023
Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring.
Speech Commun., February, 2023

Prosody Modelling With Pre-Trained Cross-Utterance Representations for Improved Speech Synthesis.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

Minimising Biasing Word Errors for Contextual ASR With the Tree-Constrained Pointer Generator.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

Estimating the Uncertainty in Emotion Class Labels With Utterance-Specific Dirichlet Priors.
IEEE Trans. Affect. Comput., 2023

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models.
CoRR, 2023

Conditional Diffusion Model for Target Speaker Extraction.
CoRR, 2023

It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation.
CoRR, 2023

Affect Recognition in Conversations Using Large Language Models.
CoRR, 2023

Knowledge Distillation from Multiple Foundation Models for End-to-End Speech Recognition.
CoRR, 2023

Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Can Contextual Biasing Remain Effective with Whisper and GPT-2?
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

A Neural Time Alignment Module for End-to-End Automatic Speech Recognition.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

UML: A Universal Monolingual Output Layer For Multilingual Asr.
Proceedings of the IEEE International Conference on Acoustics, 2023

Self-Supervised Representations in Speech-Based Depression Detection.
Proceedings of the IEEE International Conference on Acoustics, 2023

End-to-End Spoken Language Understanding with Tree-Constrained Pointer Generator.
Proceedings of the IEEE International Conference on Acoustics, 2023

Spectral Clustering-Aware Learning of Embeddings for Speaker Diarisation.
Proceedings of the IEEE International Conference on Acoustics, 2023

Context-Aware end-to-end ASR Using Self-Attentive Embedding and Tensor Fusion.
Proceedings of the IEEE International Conference on Acoustics, 2023

Improving Speech Enhancement Using Audio Tagging Knowledge From Pre-Trained Representations and Multi-Task Learning.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Transferring Speech-Generic and Depression-Specific Knowledge for Alzheimer's Disease Detection.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Estimating the Uncertainty in Emotion Attributes using Deep Evidential Regression.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

2022
On the similarities of representations in artificial and brain neural networks for speech recognition.
Frontiers Comput. Neurosci., 2022

Distribution-Based Emotion Recognition in Conversation.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

A Truly Multilingual First Pass and Monolingual Second Pass Streaming on-Device ASR System.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Turn-Taking Prediction for Natural Conversational Speech.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Improving the Fusion of Acoustic and Text Representations in RNN-T.
Proceedings of the IEEE International Conference on Acoustics, 2022

2021
Combination of deep speaker embeddings for diarisation.
Neural Networks, 2021

A distributed optimisation framework combining natural gradient with Hessian-free for discriminative sequence training.
Neural Networks, 2021

Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition.
CoRR, 2021

Discriminative Neural Clustering for Speaker Diarisation.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Neural Kalman Filtering for Speech Enhancement.
Proceedings of the IEEE International Conference on Acoustics, 2021

Improving Prosody Modelling with Cross-Utterance Bert Embeddings for End-to-End Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2021

Emotion Recognition by Fusing Time Synchronous and Time Asynchronous Representations.
Proceedings of the IEEE International Conference on Acoustics, 2021

Content-Aware Speaker Embeddings for Speaker Diarisation.
Proceedings of the IEEE International Conference on Acoustics, 2021

Transformer Language Models with LSTM-Based Cross-Utterance Information Representation.
Proceedings of the IEEE International Conference on Acoustics, 2021

Dian: Duration Informed Auto-Regressive Network for Voice Cloning.
Proceedings of the IEEE International Conference on Acoustics, 2021

Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Tree-Constrained Pointer Generator for End-to-End Contextual Speech Recognition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

2020
Multimodal Intelligence: Representation Learning, Information Fusion, and Applications.
IEEE J. Sel. Top. Signal Process., 2020

Introduction to the Special Issue on Deep Learning for Multi-Modal Intelligence Across Speech, Language, Vision, and Heterogeneous Signals.
IEEE J. Sel. Top. Signal Process., 2020

Cross-Utterance Language Models with Acoustic Error Sampling.
CoRR, 2020

Sound Event Localization and Detection Based on Multiple DOA Beamforming and Multi-Task Learning.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

The JD AI Speaker Verification System for the FFSVC 2020 Challenge.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Improved Large-Margin Softmax Loss for Speaker Diarisation.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019
Direct-Path Signal Cross-Correlation Estimation for Sound Source Localization in Reverberation.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Multi-Span Acoustic Modelling Using Raw Waveform Signals.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

PyHTK: Python Library and ASR Pipelines for HTK.
Proceedings of the IEEE International Conference on Acoustics, 2019

Speaker Diarisation Using 2D Self-attentive Combination of Embeddings.
Proceedings of the IEEE International Conference on Acoustics, 2019

Integrating Source-Channel and Attention-Based Sequence-to-Sequence Models for Speech Recognition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

2018
Semi-tied Units for Efficient Gating in LSTM and Highway Networks.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Speaker Adaptation and Adaptive Training for Jointly Optimised Tandem Systems.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

High Order Recurrent Neural Networks for Acoustic Modelling.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Improved Tdnns Using Deep Kernels and Frequency Dependent Grid-RNNS.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

2017
Joint training methods for tandem and hybrid speech recognition systems using deep neural networks
PhD thesis, 2017

Relating dynamic brain states to dynamic machine states: Human and machine solutions to the speech recognition problem.
PLoS Comput. Biol., 2017

Joint optimisation of tandem systems using Gaussian mixture density neural network discriminative sequence training.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

2016
Selection of Multi-Genre Broadcast Data for the Training of Automatic Speech Recognition Systems.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

DNN speaker adaptation using parameterised sigmoid and ReLU hidden activation functions.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

System combination with log-linear models.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Improved DNN-based segmentation for multi-genre broadcast audio.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

2015
A general artificial neural network extension for HTK.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Parameterised sigmoid and reLU hidden activation functions for DNN acoustic modelling.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

The Cambridge University 2014 BOLT conversational telephone Mandarin Chinese LVCSR system for speech translation.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Cambridge university transcription systems for the multi-genre broadcast challenge.
Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, 2015

The development of the cambridge university alignment systems for the multi-genre broadcast challenge.
Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, 2015

Speaker diarisation and longitudinal linking in multi-genre broadcast data.
Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, 2015

Structured discriminative models using deep neural-network features.
Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, 2015

2014
Standalone training of context-dependent deep neural network acoustic models.
Proceedings of the IEEE International Conference on Acoustics, 2014

2013
Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition.
IEEE Trans. Speech Audio Process., 2013

Investigation of multilingual deep neural networks for spoken term detection.
Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013

2012
Discriminative dynamic Gaussian mixture selection with enhanced robustness and performance for multi-accent speech recognition.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

2011
Reliable accent specific unit generation with dynamic Gaussian mixture selection for multi-accent speech recognition.
Proceedings of the 2011 IEEE International Conference on Multimedia and Expo, 2011

An In-car Chinese Noise Corpus for Speech Recognition.
Proceedings of the International Conference on Asian Language Processing, 2011

Detection-based accented speech recognition using articulatory features.
Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, 2011


  Loading...