Zhiyong Wu

Orcid: 0000-0001-8533-0524

Affiliations:
  • Tsinghua University, Joint Research Center for Media Sciences, Beijing, China (PhD)
  • Chinese University of Hong Kong, Hong Kong


According to our database1, Zhiyong Wu authored at least 242 papers between 2000 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Joint Multiscale Cross-Lingual Speaking Style Transfer With Bidirectional Attention Mechanism for Automatic Dubbing.
IEEE ACM Trans. Audio Speech Lang. Process., 2024

MuCodec: Ultra Low-Bitrate Music Codec.
CoRR, 2024

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions.
CoRR, 2024

Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis.
CoRR, 2024

An End-to-End Approach for Chord-Conditioned Song Generation.
CoRR, 2024

RobustSVC: HuBERT-based Melody Extractor and Adversarial Learning for Robust Singing Voice Conversion.
CoRR, 2024

SongCreator: Lyrics-based Universal Song Generation.
CoRR, 2024

Comparing Discrete and Continuous Space LLMs for Speech Recognition.
CoRR, 2024

Foundation Models for Music: A Survey.
CoRR, 2024

MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement.
CoRR, 2024

Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models.
CoRR, 2024

CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction.
CoRR, 2024

Multimodal Emotion Captioning Using Large Language Model with Prompt Engineering.
Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, 2024

Robust Representation Learning for Multimodal Emotion Recognition with Contrastive Learning and Mixup.
Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, 2024

SpeechCraft: A Fine-Grained Expressive Speech Dataset with Natural Language Description.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Representation Space Maintenance: Against Forgetting in Continual Learning.
Proceedings of the International Joint Conference on Neural Networks, 2024

NRAdapt: Noise-Robust Adaptive Text to Speech Using Untranscribed Data.
Proceedings of the International Joint Conference on Neural Networks, 2024

Hydraformer: One Encoder for All Subsampling Rates.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

The THU-HCSI Multi-Speaker Multi-Lingual Few-Shot Voice Cloning System for LIMMITS'24 Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2024

FreeTalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness.
Proceedings of the IEEE International Conference on Acoustics, 2024

Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion Models.
Proceedings of the IEEE International Conference on Acoustics, 2024

Consistent and Relevant: Rethink the Query Embedding in General Sound Separation.
Proceedings of the IEEE International Conference on Acoustics, 2024

SCNet: Sparse Compression Network for Music Source Separation.
Proceedings of the IEEE International Conference on Acoustics, 2024

Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion.
Proceedings of the IEEE International Conference on Acoustics, 2024

Unifying One-Shot Voice Conversion and Cloning with Disentangled Speech Representations.
Proceedings of the IEEE International Conference on Acoustics, 2024

Multi-View Midivae: Fusing Track- and Bar-View Representations for Long Multi-Track Symbolic Music Generation.
Proceedings of the IEEE International Conference on Acoustics, 2024

Generating Stereophonic Music with Single-Stage Language Models.
Proceedings of the IEEE International Conference on Acoustics, 2024

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts.
Proceedings of the IEEE International Conference on Acoustics, 2024

Enhancing Expressiveness in Dance Generation Via Integrating Frequency and Music Style Information.
Proceedings of the IEEE International Conference on Acoustics, 2024

Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2024

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction.
Proceedings of the IEEE International Conference on Acoustics, 2024

Collaboration of Digital Human Gestures and Teaching Materials for Enhanced Integration in MOOC Teaching Scenarios.
Proceedings of the HCI International 2024 Posters, 2024

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

SECap: Speech Emotion Captioning with Large Language Model.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

Explore 3D Dance Generation via Reward Model from Automatically-Ranked Demonstrations.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

SimCalib: Graph Neural Network Calibration Based on Similarity between Nodes.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

MSStyleTTS: Multi-Scale Style Modeling With Hierarchical Context Information for Expressive Speech Synthesis.
IEEE ACM Trans. Audio Speech Lang. Process., 2023

Lite-RTSE: Exploring a Cost-Effective Lite DNN Model for Real-Time Speech Enhancement in RTC Scenarios.
IEEE Signal Process. Lett., 2023

Stable Score Distillation for High-Quality 3D Generation.
CoRR, 2023

AdaMesh: Personalized Facial Expressions and Head Poses for Speech-Driven 3D Facial Animation.
CoRR, 2023

A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis.
CoRR, 2023

CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis.
CoRR, 2023

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing.
CoRR, 2023

First-order Multi-label Learning with Cross-modal Interactions for Multimodal Emotion Recognition.
Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing, 2023

UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

SememeASR: Boosting Performance of End-to-End Speech Recognition against Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Prosody Modeling with 3D Visual Information for Expressive Video Dubbing.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Gesper: A Restoration-Enhancement Framework for General Speech Reconstruction.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale Interfusion and Conditional Speaker Modulation.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models.
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

The DiffuseStyleGesture+ entry to the GENEA Challenge 2023.
Proceedings of the 25th International Conference on Multimodal Interaction, 2023

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation-based Voice Conversion.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2023

SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2023

VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

GTN-Bailando: Genre Consistent long-Term 3D Dance Generation Based on Pre-Trained Genre Token Network.
Proceedings of the IEEE International Conference on Acoustics, 2023

Enhancing the Vocal Range of Single-Speaker Singing Voice Synthesis with Melody-Unsupervised Pre-Training.
Proceedings of the IEEE International Conference on Acoustics, 2023

Keyword-Specific Acoustic Model Pruning for Open-Vocabulary Keyword Spotting.
Proceedings of the IEEE International Conference on Acoustics, 2023

CB-Conformer: Contextual Biasing Conformer for Biased Word Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2023

DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification.
Proceedings of the IEEE International Conference on Acoustics, 2023

A Synthetic Corpus Generation Method for Neural Vocoder Training.
Proceedings of the IEEE International Conference on Acoustics, 2023

TFCnet: Time-Frequency Domain Corrector for Speech Separation.
Proceedings of the IEEE International Conference on Acoustics, 2023

TrimTail: Low-Latency Streaming ASR with Simple But Effective Spectrogram-Level Length Penalty.
Proceedings of the IEEE International Conference on Acoustics, 2023

Av-Sepformer: Cross-Attention Sepformer for Audio-Visual Target Speaker Extraction.
Proceedings of the IEEE International Conference on Acoustics, 2023

Context-Aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2023

LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech.
Proceedings of the IEEE International Conference on Acoustics, 2023

Gesper: A Unified Framework for General Speech Restoration.
Proceedings of the IEEE International Conference on Acoustics, 2023

Inter-Subnet: Speech Enhancement with Subband Interaction.
Proceedings of the IEEE International Conference on Acoustics, 2023

Wavsyncswap: End-To-End Portrait-Customized Audio-Driven Talking Face Generation.
Proceedings of the IEEE International Conference on Acoustics, 2023

QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Robust Representation Learning for Speech Emotion Recognition with Moment Exchange.
Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2023

What Does Your Face Sound Like? 3D Face Shape towards Voice.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning.
IEEE ACM Trans. Audio Speech Lang. Process., 2022

FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition.
CoRR, 2022

Disentangled Speech Representation Learning for One-Shot Cross-lingual Voice Conversion Using β-VAE.
CoRR, 2022

Ordinal Regression via Binary Preference vs Simple Regression: Statistical and Experimental Perspectives.
CoRR, 2022

Disentangleing Content and Fine-grained Prosody Information via Hybrid ASR Bottleneck Features for Voice Conversion.
CoRR, 2022

Disentangled Speech Representation Learning for One-Shot Cross-Lingual Voice Conversion Using ß-VAE.
Proceedings of the IEEE Spoken Language Technology Workshop, 2022

Tackling Spoofing-Aware Speaker Verification with Multi-Model Fusion.
Proceedings of the Odyssey 2022: The Speaker and Language Recognition Workshop, 28 June, 2022

Inferring Speaking Styles from Multi-modal Conversational Context by Multi-scale Relational Graph Convolutional Networks.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

Boosting the Performance of SpEx+ by Attention and Contextual Mechanism.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

HILvoice:Human-in-the-Loop Style Selection for Elder-Facing Speech Synthesis.
Proceedings of the 13th International Symposium on Chinese Spoken Language Processing, 2022

Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Towards Cross-speaker Reading Style Transfer on Audiobook Dataset.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Speech Enhancement with Fullband-Subband Cross-Attention Network.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Speaker Characteristics Guided Speech Synthesis.
Proceedings of the International Joint Conference on Neural Networks, 2022

The ReprGesture entry to the GENEA Challenge 2022.
Proceedings of the International Conference on Multimodal Interaction, 2022

Learning from Designers: Fashion Compatibility Analysis Via Dataset Distillation.
Proceedings of the 2022 IEEE International Conference on Image Processing, 2022

Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion.
Proceedings of the IEEE International Conference on Acoustics, 2022

An Approach to Mispronunciation Detection and Diagnosis with Acoustic, Phonetic and Linguistic (APL) Embeddings.
Proceedings of the IEEE International Conference on Acoustics, 2022

Neural Architecture Search for Speech Emotion Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2022

Adversarial Sample Detection for Speaker Verification by Neural Vocoders.
Proceedings of the IEEE International Conference on Acoustics, 2022

Neufa: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism.
Proceedings of the IEEE International Conference on Acoustics, 2022

Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-Based Multi-Modal Context Modeling.
Proceedings of the IEEE International Conference on Acoustics, 2022

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2022

An End-to-End Chinese Text Normalization Model Based on Rule-Guided Flat-Lattice Transformer.
Proceedings of the IEEE International Conference on Acoustics, 2022

FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement.
Proceedings of the IEEE International Conference on Acoustics, 2022

Transformer-S2A: Robust and Efficient Speech-to-Animation.
Proceedings of the IEEE International Conference on Acoustics, 2022

A Character-Level Span-Based Model for Mandarin Prosodic Structure Prediction.
Proceedings of the IEEE International Conference on Acoustics, 2022

Unsupervised Multi-scale Expressive Speaking Style Modeling with Hierarchical Context Information for Audiobook Speech Synthesis.
Proceedings of the 29th International Conference on Computational Linguistics, 2022

2021
Speech Emotion Recognition Using Sequential Capsule Networks.
IEEE ACM Trans. Audio Speech Lang. Process., 2021

Exemplar-Based Emotive Speech Synthesis.
IEEE ACM Trans. Audio Speech Lang. Process., 2021

Spotting adversarial samples for speaker verification by neural vocoders.
CoRR, 2021

Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis.
CoRR, 2021

Dependency Parsing based Semantic Representation Learning with Graph Neural Network for Enhancing Expressiveness of Text-to-Speech.
CoRR, 2021

Adversarially learning disentangled speech representations for robust multi-factor voice conversion.
CoRR, 2021

Controllable Emphatic Speech Synthesis based on Forward Attention for Expressive Speech Synthesis.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Unsupervised Cross-Lingual Speech Emotion Recognition Using Domain Adversarial Neural Network.
Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Voting for the Right Answer: Adversarial Defense for Speaker Verification.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Adversarially Learning Disentangled Speech Representations for Robust Multi-Factor Voice Conversion.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Towards Multi-Scale Style Control for Expressive Speech Synthesis.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Adversarial Defense for Automatic Speaker Verification by Cascaded Self-Supervised Learning Models.
Proceedings of the IEEE International Conference on Acoustics, 2021

The Huya Multi-Speaker and Multi-Style Speech Synthesis System for M2voc Challenge 2020.
Proceedings of the IEEE International Conference on Acoustics, 2021

Improving Pronunciation Assessment Via Ordinal Regression with Anchored Reference Samples.
Proceedings of the IEEE International Conference on Acoustics, 2021

Syntactic Representation Learning For Neural Network Based TTS with Syntactic Parse Tree Traversal.
Proceedings of the IEEE International Conference on Acoustics, 2021

Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input.
Proceedings of the IEEE International Conference on Acoustics, 2021

Emotion Controllable Speech Synthesis Using Emotion-Unlabeled Dataset with the Assistance of Cross-Domain Speech Emotion Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2021

Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

PTeacher: a Computer-Aided Personalized Pronunciation Training System with Exaggerated Audio-Visual Corrective Feedback.
Proceedings of the CHI '21: CHI Conference on Human Factors in Computing Systems, 2021

Reconstructing Dual Learning for Neural Voice Conversion Using Relatively Few Samples.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2021

Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2021

Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation based Deep Learning Approach.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021

2020
Unsupervised Cross-Lingual Speech Emotion Recognition Using DomainAdversarial Neural Network.
CoRR, 2020

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement.
CoRR, 2020

Re-Weighted Interval Loss for Handling Data Imbalance Problem of End-to-End Keyword Spotting.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-Attention Networks.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Enhancing Monotonicity for Robust Autoregressive Transformer TTS.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

FERNet: Fine-grained Extraction and Reasoning Network for Emotion Recognition in Dialogues.
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020

Channel-Wise Dense Connection Graph Convolutional Network for Skeleton-Based Action Recognition.
Proceedings of the 25th International Conference on Pattern Recognition, 2020

End-To-End Accent Conversion Without Using Native Utterances.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Code-Switched Speech Synthesis Using Bilingual Phonetic Posteriorgram with Only Monolingual Corpora.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019
Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks.
CoRR, 2019

One-Shot Voice Conversion with Global Speaker Embeddings.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Knowledge-Based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-Trained BERT.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Towards Discriminative Representation Learning for Speech Emotion Recognition.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019

Modeling Emotion Influence Using Attention-based Graph Convolutional Recurrent Network.
Proceedings of the International Conference on Multimodal Interaction, 2019

Speech Emotion Recognition Using Capsule Networks.
Proceedings of the IEEE International Conference on Acoustics, 2019

Quasi-fully Convolutional Neural Network with Variational Inference for Speech Synthesis.
Proceedings of the IEEE International Conference on Acoustics, 2019

NN-based Ordinal Regression for Assessing Fluency of ESL Speech.
Proceedings of the IEEE International Conference on Acoustics, 2019

A Compact Framework for Voice Conversion Using Wavenet Conditioned on Phonetic Posteriorgrams.
Proceedings of the IEEE International Conference on Acoustics, 2019

Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2019

Learning Discriminative Features from Spectrograms Using Center Loss for Speech Emotion Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2019

End-to-end Code-switched TTS with Mix of Monolingual Recordings.
Proceedings of the IEEE International Conference on Acoustics, 2019

Query-by-Example Spoken Term Detection using Attentive Pooling Networks.
Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019

Learning Contextual Representation with Convolution Bank and Multi-head Self-attention for Speech Emphasis Detection.
Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019

Automatic Prosodic Structure Labeling using DNN-BGRU-CRF Hybrid Neural Network.
Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019

Prosodic Structure Prediction using Deep Self-attention Neural Network.
Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019

Multi-Scale Convolutional Recurrent Neural Network with Ensemble Method for Weakly Labeled Sound Event Detection.
Proceedings of the 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, 2019

2018
Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks.
Speech Commun., 2018

Inferring User Emotive State Changes in Realistic Human-Computer Conversational Dialogs.
Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference, 2018

Speech Super-Resolution Using Parallel WaveNet.
Proceedings of the 11th International Symposium on Chinese Spoken Language Processing, 2018

Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Detection of Glottal Closure Instants from Speech Signals: A Convolutional Neural Network Based Method.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Integrating Articulatory Features into Acoustic-Phonemic Model for Mispronunciation Detection and Diagnosis in L2 English Speech.
Proceedings of the 2018 IEEE International Conference on Multimedia and Expo, 2018

Feature Based Adaptation for Speaking Style Synthesis.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Applying Multitask Learning to Acoustic-Phonemic Model for Mispronunciation Detection and Diagnosis in L2 English Speech.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Unsupervised Discovery of an Extended Phoneme Set in L2 English Speech for Mispronunciation Detection and Diagnosis.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Emphatic Speech Generation with Conditioned Input Layer and Bidirectional LSTMS for Expressive Speech Synthesis.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Learning Frame-Level Recurrent Neural Networks Representations for Query-by-Example Spoken Term Detection on Mobile Devices.
Proceedings of the Artificial Intelligence and Mobile Services - AIMS 2018, 2018

Multi-modal Multi-scale Speech Expression Evaluation in Computer-Assisted Language Learning.
Proceedings of the Artificial Intelligence and Mobile Services - AIMS 2018, 2018

2017
Movie Recommendation via BLSTM.
Proceedings of the MultiMedia Modeling - 23rd International Conference, 2017

Speech Emotion Recognition with Emotion-Pair Based Framework Considering Emotion Distribution Information in Dimensional Emotion Space.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Spectro-Temporal Modelling with Time-Frequency LSTM and Structured Output Layer for Voice Conversion.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Multi-Task Learning for Prosodic Structure Generation Using BLSTM RNN with Structured Output Layer.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Learning cross-lingual knowledge with multilingual BLSTM for emphasis detection with limited training data.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

Multi-task learning of structured output layer bidirectional LSTMS for speech synthesis.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems.
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017

2016
Study on Feature Subspace of Archetypal Emotions for Speech Emotion Recognition.
CoRR, 2016

A Real-Time Gesture-Based Unmanned Aerial Vehicle Control System.
Proceedings of the Advances in Multimedia Information Processing - PCM 2016, 2016

Video Inpainting Based on Joint Gradient and Noise Minimization.
Proceedings of the Advances in Multimedia Information Processing - PCM 2016, 2016

3D modeling based on multiple Unmanned Aerial Vehicles with the optimal paths.
Proceedings of the International Symposium on Intelligent Signal Processing and Communication Systems, 2016

DBLSTM-based multi-task learning for pitch transformation in voice conversion.
Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, 2016

Analysis on Gated Recurrent Unit Based Question Detection Approach.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Combining CNN and BLSTM to Extract Textual and Acoustic Features for Recognizing Stances in Mandarin Ideological Debate Competition.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Phoneme Embedding and its Application to Speech Driven Talking Avatar Synthesis.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Heterogeneity-entropy based unsupervised feature learning for personality prediction with cross-media data.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2016

Recognizing stances in Mandarin social ideological debates with text and acoustic features.
Proceedings of the 2016 IEEE International Conference on Multimedia & Expo Workshops, 2016

Learning cross-lingual information with multilingual BLSTM for speech synthesis of low-resource languages.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Question detection from acoustic features using recurrent neural network with gated recurrent unit.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

Low level descriptors based DBLSTM bottleneck feature for speech driven talking avatar.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

2015
Acoustic to articulatory mapping with deep neural network.
Multim. Tools Appl., 2015

Generating emphatic speech with hidden Markov model for expressive speech synthesis.
Multim. Tools Appl., 2015

Polyphonic Music Modelling with LSTM-RTRBM.
Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM '15, Brisbane, Australia, October 26, 2015

Using tilt for automatic emphasis detection with Bayesian networks.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Modelling High-Dimensional Sequences with LSTM-RTRBM: Application to Polyphonic Music Generation.
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015

HMM-based emphatic speech synthesis for corrective feedback in computer-aided pronunciation training.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

A deep recurrent approach for acoustic-to-articulatory inversion.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Understanding speaking styles of internet speech data with LSTM and low-resource training.
Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction, 2015

2014
Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training.
Multim. Tools Appl., 2014

Head and facial gestures synthesis using PAD model for an expressive talking avatar.
Multim. Tools Appl., 2014

Automatic speech data clustering with human perception based weighted distance.
Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, 2014

Multi-channel speech enhancement using sparse coding on local time-frequency structures.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Using conditional random fields to predict focus word pair in spontaneous spoken English.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Contrastive auto-encoder for phoneme recognition.
Proceedings of the IEEE International Conference on Acoustics, 2014

Learning dynamic features with neural networks for phoneme recognition.
Proceedings of the IEEE International Conference on Acoustics, 2014

Automatic Emotion Variation Detection in continuous speech.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2014

2013
Feature Learning with Gaussian Restricted Boltzmann Machine for Robust Speech Recognition.
CoRR, 2013

Investigation of tandem deep belief network approach for phoneme recognition.
Proceedings of the IEEE International Conference on Acoustics, 2013

A real-time speech driven talking avatar based on deep neural network.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013

Frequency-domain dereverberation on speech signal using surround retinex.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013

Sparse coding for sound event classification.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013

Comparing feature dimension reduction algorithms for GMM-SVM based speech emotion recognition.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013

2012
Comparison of adaptation methods for GMM-SVM based speech emotion recognition.
Proceedings of the 2012 IEEE Spoken Language Technology Workshop (SLT), 2012

Adaptive named entity recognition based on conditional random fields with automatic updated dynamic gazetteers.
Proceedings of the 8th International Symposium on Chinese Spoken Language Processing, 2012

Detection and emphatic realization of contrastive word pairs for expressive text-to-speech synthesis.
Proceedings of the 8th International Symposium on Chinese Spoken Language Processing, 2012

Perceptual clustering based unit selection optimization for concatenative text-to-speech synthesis.
Proceedings of the 8th International Symposium on Chinese Spoken Language Processing, 2012

Hierarchical English Emphatic Speech Synthesis Based on HMM with Limited Training Data.
Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

Modeling the correlation between modality semantics and facial expressions.
Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2012

2011
Combining Active and Semi-Supervised Learning for Homograph Disambiguation in Mandarin Text-to-Speech Synthesis.
Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011

2010
Modeling prosody patterns for Chinese expressive text-to-speech synthesis.
Proceedings of the 7th International Symposium on Chinese Spoken Language Processing, 2010

Comparison of Syllable/Phone HMM Based Mandarin TTS.
Proceedings of the 20th International Conference on Pattern Recognition, 2010

Facial Expression Synthesis Based on Emotion Dimensions for Affective Talking Avatar.
Proceedings of the Modeling Machine Emotions for Realizing Intelligence, 2010

2009
Modeling the Expressivity of Input Text Semantics for Chinese Text-to-Speech Synthesis in a Spoken Dialog System.
IEEE Trans. Speech Audio Process., 2009

2008
The Use of Dynamic Deformable Templates for Lip Tracking in an Audio-Visual Corpus with Large Variations in Head Pose, Face Illumination and Lip Shapes.
Proceedings of the 6th International Symposium on Chinese Spoken Language Processing, 2008

A New Prosodic Strength Calculation Method for Prosody Reduction Modeling.
Proceedings of the 6th International Symposium on Chinese Spoken Language Processing, 2008

2007
Head Movement Synthesis Based on Semantic and Prosodic Features for a Chinese Expressive Avatar.
Proceedings of the IEEE International Conference on Acoustics, 2007

Facial Expression Synthesis Using PAD Emotional Parameters for a Chinese Expressive Avatar.
Proceedings of the Affective Computing and Intelligent Interaction, 2007

2006
Modelling the Global acoustic Correlates of Expressivity for Chinese Text-to-speech Synthesis.
Proceedings of the 2006 IEEE ACL Spoken Language Technology Workshop, 2006

A Corpus-Based Approach for Cooperative Response Generation in a Dialog System.
Proceedings of the Chinese Spoken Language Processing, 5th International Symposium, 2006

Real-time synthesis of Chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar.
Proceedings of the Ninth International Conference on Spoken Language Processing, 2006

Multi-level Fusion of Audio and Visual Features for Speaker Identification.
Proceedings of the Advances in Biometrics, International Conference, 2006

2000
Research on dynamic characters of Chinese pitch contours.
Proceedings of the Sixth International Conference on Spoken Language Processing, 2000


  Loading...