Yu Zhang

Orcid: 0000-0002-9505-1833

Affiliations:
  • Google
  • Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA (PhD 2017)


According to our database1, Yu Zhang authored at least 136 papers between 2013 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study.
Proceedings of the IEEE International Conference on Acoustics, 2024

IG Captioner: Information Gain Captioners Are Strong Zero-Shot Classifiers.
Proceedings of the Computer Vision - ECCV 2024, 2024

2023
SLM: Bridge the thin gap between speech and text foundation models.
CoRR, 2023

Multimodal Modeling For Spoken Language Identification.
CoRR, 2023

AudioPaLM: A Large Language Model That Can Speak and Listen.
CoRR, 2023

Efficient Adapters for Giant Speech Models.
CoRR, 2023

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages.
CoRR, 2023

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations.
Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2023

Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Mixture-of-Expert Conformer for Streaming Multilingual ASR.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

How to Estimate Model Transferability of Pre-Trained Speech Models?
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Mu<sup>2</sup>SLAM: Multitask, Multilingual Speech and Language Models.
Proceedings of the International Conference on Machine Learning, 2023

A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2023

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2023

Understanding Shared Speech-Text Representations.
Proceedings of the IEEE International Conference on Acoustics, 2023

Accelerating RNN-T Training and Inference Using CTC Guidance.
Proceedings of the IEEE International Conference on Acoustics, 2023

Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech.
Proceedings of the IEEE International Conference on Acoustics, 2023

JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2023

Efficient Domain Adaptation for Speech Foundation Models.
Proceedings of the IEEE International Conference on Acoustics, 2023

Comparison of Soft and Hard Target RNN-T Distillation for Large-Scale ASR.
Proceedings of the IEEE International Conference on Acoustics, 2023

Massively Multilingual Shallow Fusion with Large Language Models.
Proceedings of the IEEE International Conference on Acoustics, 2023

SLM: Bridge the Thin Gap Between Speech and Text Foundation Models.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

Improving Multilingual and Code-Switching ASR Using Large Language Model Generated Text.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

E3 TTS: Easy End-to-End Diffusion-Based Text To Speech.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

2022
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition.
IEEE J. Sel. Top. Signal Process., 2022

Ask2Mask: Guided Data Selection for Masked Speech Modeling.
IEEE J. Sel. Top. Signal Process., 2022

Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation.
CoRR, 2022

mSLAM: Massively multilingual joint pre-training for speech and text.
CoRR, 2022

JOIST: A Joint Speech and Text Streaming Model for ASR.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Modular Hybrid Autoregressive Transducer.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Improving Generalizability of Distilled Self-Supervised Speech Processing Models Under Distorted Settings.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Maestro-U: Leveraging Joint Speech-Text Representation Learning for Zero Supervised Speech ASR.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Unsupervised Data Selection via Discrete Speech Representation for ASR.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

XTREME-S: Evaluating Cross-lingual Speech Representations.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

MAESTRO: Matched Speech Text Representations through Modality Matching.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Reducing Domain mismatch in Self-supervised speech pre-training.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Universal Paralinguistic Speech Representations Using self-Supervised Conformers.
Proceedings of the IEEE International Conference on Acoustics, 2022


Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2022

Massively Multilingual ASR: A Lifelong Learning Solution.
Proceedings of the IEEE International Conference on Acoustics, 2022

Tts4pretrain 2.0: Advancing the use of Text and Speech in ASR Pretraining with Consistency and Contrastive Losses.
Proceedings of the IEEE International Conference on Acoustics, 2022

Joint Unsupervised and Supervised Training for Multilingual ASR.
Proceedings of the IEEE International Conference on Acoustics, 2022

2021
SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training.
CoRR, 2021

Scaling End-to-End Models for Large-Scale Multilingual ASR.
CoRR, 2021

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network.
CoRR, 2021

RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Unsupervised Learning of Disentangled Speech Content and Style Representation.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Exploring Targeted Universal Adversarial Perturbations to End-to-End ASR Models.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Residual Energy-Based Models for End-to-End Speech Recognition.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

WaveGrad: Estimating Gradients for Waveform Generation.
Proceedings of the 9th International Conference on Learning Representations, 2021

Echo State Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2021

Learning Word-Level Confidence for Subword End-To-End ASR.
Proceedings of the IEEE International Conference on Acoustics, 2021

Confidence Estimation for Attention-Based Sequence-to-Sequence Models for Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2021

A Better and Faster end-to-end Model for Streaming ASR.
Proceedings of the IEEE International Conference on Acoustics, 2021

Parallel Tacotron: Non-Autoregressive and Controllable TTS.
Proceedings of the IEEE International Conference on Acoustics, 2021

Improving Streaming Automatic Speech Recognition with Non-Streaming Model Distillation on Unsupervised Data.
Proceedings of the IEEE International Conference on Acoustics, 2021

Scaling End-to-End Models for Large-Scale Multilingual ASR.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Injecting Text in Self-Supervised Speech Pretraining.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

2020
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition.
CoRR, 2020

Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling.
CoRR, 2020

A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency.
CoRR, 2020

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior.
CoRR, 2020

A Large Scale Speech Sentiment Corpus.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020

SCADA: Stochastic, Consistent and Adversarial Data Augmentation to Improve ASR.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Improved Noisy Student Training for Automatic Speech Recognition.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Conformer: Convolution-augmented Transformer for Speech Recognition.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Multistate Encoding with End-To-End Speech RNN Transducer Network.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Improving Speech Recognition Using Consistent Predictions on Synthesized Speech.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Fully-Hierarchical Fine-Grained Prosody Modeling For Interpretable Speech Synthesis.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020


Specaugment on Large Scale Datasets.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Speech Sentiment Analysis via Pre-Trained Features from End-to-End ASR Models.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Scalability in Perception for Autonomous Driving: Waymo Open Dataset.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019
The ASVspoof 2019 database.
CoRR, 2019

Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling.
CoRR, 2019

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Hierarchical Generative Modeling for Controllable Speech Synthesis.
Proceedings of the 7th International Conference on Learning Representations, 2019

Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes.
Proceedings of the IEEE International Conference on Acoustics, 2019

Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization.
Proceedings of the IEEE International Conference on Acoustics, 2019

Cycle-consistency Training for End-to-end Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2019

Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2019

End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds.
Proceedings of the 3rd Annual Conference on Robot Learning, 2019

Speech Recognition with Augmented Synthesized Speech.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

A Comparison of End-to-End Models for Long-Form Speech Recognition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

2018
Back-Translation-Style Data Augmentation for end-to-end ASR.
Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis.
Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 2018

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis.
Proceedings of the 35th International Conference on Machine Learning, 2018

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

2017
Exploring neural network architectures for acoustic modeling.
PhD thesis, 2017

Training RNNs as Fast as CNNs.
CoRR, 2017

Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data.
Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 2017

Learning Latent Representations for Speech Generation and Transformation.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Latent Sequence Decompositions.
Proceedings of the 5th International Conference on Learning Representations, 2017

Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation.
Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

Advanced Recurrent Neural Networks for Automatic Speech Recognition.
Proceedings of the New Era for Robust Speech Recognition, Exploiting Deep Learning., 2017

Discriminative Beamforming with Phase-Aware Neural Networks for Speech Enhancement and Recognition.
Proceedings of the New Era for Robust Speech Recognition, Exploiting Deep Learning., 2017

Sequence-Discriminative Training of Neural Networks.
Proceedings of the New Era for Robust Speech Recognition, Exploiting Deep Learning., 2017

2016
Recurrent Neural Network Encoder with Attention for Community Question Answering.
CoRR, 2016

A prioritized grid long short-term memory RNN for speech recognition.
Proceedings of the 2016 IEEE Spoken Language Technology Workshop, 2016

SLS at SemEval-2016 Task 3: Neural-based Approaches for Ranking in Community Question Answering.
Proceedings of the 10th International Workshop on Semantic Evaluation, 2016

On training bi-directional neural network language model with noise contrastive estimation.
Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, 2016

Exploiting Depth and Highway Connections in Convolutional Recurrent Deep Neural Networks for Speech Recognition.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Highway long short-term memory RNNS for distant speech recognition.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Prediction-adaptation-correction recurrent neural networks for low-resource language speech recognition.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Deep beamforming networks for multi-channel speech recognition.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Speaker-aware training of LSTM-RNNS for acoustic modelling.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Integrated adaptation with multi-factor joint-learning for far-field speech recognition.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Multilingual data selection for training stacked bottleneck features.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Neural Attention for Learning to Rank Questions in Community Question Answering.
Proceedings of the COLING 2016, 2016

2015
The Computational Network Toolkit [Best of the Web].
IEEE Signal Process. Mag., 2015

Speaker adaptation using the i-vector technique for bottleneck features.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Speech recognition with prediction-adaptation-correction recurrent neural networks.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

2014
Spoken language understanding using long short-term memory neural networks.
Proceedings of the 2014 IEEE Spoken Language Technology Workshop, 2014

Graph-based re-ranking using acoustic feature similarity between search results for spoken term detection on low-resource languages.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Language ID-based training of multilingual stacked bottleneck features.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Recent advances in ASR applied to an Arabic transcription system for Al-Jazeera.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Extracting deep neural network bottleneck features using low-rank matrix factorization.
Proceedings of the IEEE International Conference on Acoustics, 2014

2013
Joint Learning of Phonetic Units and Word Pronunciations for ASR.
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013


  Loading...