Zejun Ma

Orcid: 0009-0009-6731-0541

According to our database1, Zejun Ma authored at least 105 papers between 2010 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to Real-Network Performance.
IEEE Trans. Neural Networks Learn. Syst., August, 2024

Video Instruction Tuning With Synthetic Data.
CoRR, 2024

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models.
CoRR, 2024

Can Large Language Models Understand Spatial Audio?
CoRR, 2024

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning.
CoRR, 2024

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

SALMONN: Towards Generic Hearing Abilities for Large Language Models.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

PolyVoice: Language Models for Speech to Speech Translation.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Connecting Speech Encoder and Large Language Model for ASR.
Proceedings of the IEEE International Conference on Acoustics, 2024

Extending Large Language Models for Speech and Audio Captioning.
Proceedings of the IEEE International Conference on Acoustics, 2024

Extending Multilingual ASR to New Languages Using Supplementary Encoder and Decoder Components.
Proceedings of the IEEE International Conference on Acoustics, 2024

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR.
Proceedings of the IEEE International Conference on Acoustics, 2024

2023
Adaptive Transfer Kernel Learning for Transfer Gaussian Process Regression.
IEEE Trans. Pattern Anal. Mach. Intell., June, 2023

Transfer Kernel Learning for Multi-Source Transfer Gaussian Process Regression.
IEEE Trans. Pattern Anal. Mach. Intell., March, 2023

Graph contrastive learning with implicit augmentations.
Neural Networks, 2023

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models.
CoRR, 2023

Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts.
CoRR, 2023

Language-specific Acoustic Boundary Learning for Mandarin-English Code-switching Speech Recognition.
CoRR, 2023

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias.
CoRR, 2023

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis.
CoRR, 2023

PolyVoice: Language Models for Speech to Speech Translation.
CoRR, 2023

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation.
CoRR, 2023

GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation.
CoRR, 2023

Unleashing Infinite-Length Input Capacity for Large-scale Language Models with Self-Controlled Memory System.
CoRR, 2023

Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions and Prospects.
Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023

Emotionally Situated Text-to-Speech Synthesis in User-Agent Conversation.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

S2CD: Self-heuristic Speaker Content Disentanglement for Any-to-Any Voice Conversion.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Disentangling the Contribution of Non-native Speech in Automated Pronunciation Assessment.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Language-specific Boundary Learning for Improving Mandarin-English Code-switching Speech Recognition.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Knowledge Distillation Approach for Efficient Internal Language Model Estimation.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

AudioQR: Deep Neural Audio Watermarks For QR Code.
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

Dynamics Analysis of Large-Scale Transmission Tower-Line Coupled System under Measured Typhoon Load.
Proceedings of the 6th International Conference on Information Technologies and Electrical Engineering, 2023

Virtual Try-On with Pose-Garment Keypoints Guided Inpainting.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

LiteG2P: A Fast, Light and High Accuracy Model for Grapheme-to-Phoneme Conversion.
Proceedings of the IEEE International Conference on Acoustics, 2023

Internal Language Model Estimation Based Adaptive Language Model Fusion for Domain Adaptation.
Proceedings of the IEEE International Conference on Acoustics, 2023

An ASR-Free Fluency Scoring Approach with Self-Supervised Learning.
Proceedings of the IEEE International Conference on Acoustics, 2023

Leveraging Phone-Level Linguistic-Acoustic Similarity For Utterance-Level Pronunciation Scoring.
Proceedings of the IEEE International Conference on Acoustics, 2023

Bytecover3: Accurate Cover Song Identification On Short Queries.
Proceedings of the IEEE International Conference on Acoustics, 2023

Improving Large-Scale Deep Biasing With Phoneme Features and Text-Only Data in Streaming Transducer.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

2022
Sequence-Level Speaker Change Detection With Difference-Based Continuous Integrate-and-Fire.
IEEE Signal Process. Lett., 2022

Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features.
CoRR, 2022

Improving short-video speech recognition using random utterance concatenation.
CoRR, 2022

Unsupervised Video Domain Adaptation: A Disentanglement Perspective.
CoRR, 2022

A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation.
CoRR, 2022

Improving Non-native Word-level Pronunciation Scoring with Phone-level Mixup Data Augmentation and Multi-source Information.
CoRR, 2022

S3T: Self-Supervised Pre-training with Swin Transformer for Music Classification.
CoRR, 2022

Improving Contextual Representation with Gloss Regularized Pre-training.
Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, 2022

GIO: A Timbre-informed Approach for Pitch Tracking in Highly Noisy Environments.
Proceedings of the ICMR '22: International Conference on Multimedia Retrieval, Newark, NJ, USA, June 27, 2022

Synthesising Audio Adversarial Examples for Automatic Speech Recognition.
Proceedings of the KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14, 2022

Importance Prioritized Policy Distillation.
Proceedings of the KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14, 2022

Latent feature augmentation for chorus detection.
Proceedings of the 23rd International Society for Music Information Retrieval Conference, 2022

Towards high-fidelity singing voice conversion with acoustic reference and contrastive predictive coding.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

A Transfer and Multi-Task Learning based Approach for MOS Prediction.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Bring dialogue-context into RNN-T for streaming ASR.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Using Fluency Representation Learned from Sequential Raw Features for Improving Non-native Fluency Scoring.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Token-level Speaker Change Detection Using Speaker Difference and Speech Content via Continuous Integrate-and-fire.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

BiFSMN: Binary Neural Network for Keyword Spotting.
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022

S3T: Self-Supervised Pre-Training with Swin Transformer For Music Classification.
Proceedings of the IEEE International Conference on Acoustics, 2022

Towards Using Clothes Style Transfer for Scenario-Aware Person Video Generation.
Proceedings of the IEEE International Conference on Acoustics, 2022

The Volcspeech System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge.
Proceedings of the IEEE International Conference on Acoustics, 2022

Language Adaptive Cross-Lingual Speech Representation Learning with Sparse Sharing Sub-Networks.
Proceedings of the IEEE International Conference on Acoustics, 2022

Improving Pseudo-Label Training For End-To-End Speech Recognition Using Gradient Mask.
Proceedings of the IEEE International Conference on Acoustics, 2022

Improving End-to-End Contextual Speech Recognition with Fine-Grained Contextual Knowledge Selection.
Proceedings of the IEEE International Conference on Acoustics, 2022

Bytecover2: Towards Dimensionality Reduction of Latent Embedding for Efficient Cover Song Identification.
Proceedings of the IEEE International Conference on Acoustics, 2022

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection.
Proceedings of the IEEE International Conference on Acoustics, 2022

Dynamic Transfer Gaussian Process Regression.
Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022

Zero-Shot Audio Source Separation through Query-Based Learning from Weakly-Labeled Data.
Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021
Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech.
CoRR, 2021

Towards Realistic Visual Dubbing with Heterogeneous Sources.
Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders.
Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Fine-Grained Prosody Modeling in Neural Speech Synthesis Using ToBI Representation.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

HMM-Free Encoder Pre-Training for Streaming RNN Transducer.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Emitting Word Timings with HMM-Free End-to-End System in Automatic Speech Recognition.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Improving RNN Transducer Modeling for Small-Footprint Keyword Spotting.
Proceedings of the IEEE International Conference on Acoustics, 2021

A Chapter-Wise Understanding System for Text-To-Speech in Chinese Novels.
Proceedings of the IEEE International Conference on Acoustics, 2021

PPG-Based Singing Voice Conversion with Adversarial Representation Learning.
Proceedings of the IEEE International Conference on Acoustics, 2021

Rule-Embedded Network for Audio-Visual Voice Activity Detection in Live Musical Video Streams.
Proceedings of the IEEE International Conference on Acoustics, 2021

An Hrnet-Blstm Model With Two-Stage Training For Singing Melody Extraction.
Proceedings of the IEEE International Conference on Acoustics, 2021

Singing Melody Extraction from Polyphonic Music based on Spectral Correlation Modeling.
Proceedings of the IEEE International Conference on Acoustics, 2021

Bytecover: Cover Song Identification Via Multi-Loss Training.
Proceedings of the IEEE International Conference on Acoustics, 2021

2020
Improving RNN transducer with normalized jointer network.
CoRR, 2020

Dynamic latency speech recognition with asynchronous revision.
CoRR, 2020

Contrastive Unsupervised Learning for Audio Fingerprinting.
CoRR, 2020

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech.
CoRR, 2020

A Hybrid Text Normalization System Using Multi-Head Self-Attention For Mandarin.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

A Unified Sequence-to-Sequence Front-End Model for Mandarin Text-to-Speech Synthesis.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019
Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

2017
Deep LSTM for Large Vocabulary Continuous Speech Recognition.
CoRR, 2017

Frame Stacking and Retaining for Recurrent Neural Network Acoustic Model.
CoRR, 2017

Exponential Moving Average Model in Parallel Speech Recognition Training.
CoRR, 2017

2012
Unsupervised training of subspace gaussian mixture models for conversational telephone speech recognition.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

2011
Fusing Multiple Confidence Measures for Chinese Spoken Term Detection.
Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011

An Empirical Study of Multilingual Spoken Term Detection.
Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011

2010
Distributed link-aware rate allocation for R-D optimal multiple video streaming over wireless networks.
Proceedings of the International Conference on Wireless Communications and Signal Processing, 2010


  Loading...