Zejun Ma

Orcid: 0009-0009-6731-0541

According to our database¹, Zejun Ma authored at least 105 papers between 2010 and 2024.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2024

BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to Real-Network Performance.

[BibT_eX]

[DOI]

IEEE Trans. Neural Networks Learn. Syst., August, 2024

Video Instruction Tuning With Synthetic Data.

[BibT_eX]

[DOI]

CoRR, 2024

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models.

[BibT_eX]

[DOI]

CoRR, 2024

Can Large Language Models Understand Spatial Audio?

[BibT_eX]

[DOI]

CoRR, 2024

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning.

[BibT_eX]

[DOI]

CoRR, 2024

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

SALMONN: Towards Generic Hearing Abilities for Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

PolyVoice: Language Models for Speech to Speech Translation.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Connecting Speech Encoder and Large Language Model for ASR.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Extending Large Language Models for Speech and Audio Captioning.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

Extending Multilingual ASR to New Languages Using Supplementary Encoder and Decoder Components.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

2023

Adaptive Transfer Kernel Learning for Transfer Gaussian Process Regression.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., June, 2023

Transfer Kernel Learning for Multi-Source Transfer Gaussian Process Regression.

[BibT_eX]

[DOI]

IEEE Trans. Pattern Anal. Mach. Intell., March, 2023

Graph contrastive learning with implicit augmentations.

[BibT_eX]

[DOI]

Neural Networks, 2023

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts.

[BibT_eX]

[DOI]

CoRR, 2023

Language-specific Acoustic Boundary Learning for Mandarin-English Code-switching Speech Recognition.

[BibT_eX]

[DOI]

CoRR, 2023

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias.

[BibT_eX]

[DOI]

CoRR, 2023

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis.

[BibT_eX]

[DOI]

CoRR, 2023

PolyVoice: Language Models for Speech to Speech Translation.

[BibT_eX]

[DOI]

CoRR, 2023

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation.

[BibT_eX]

[DOI]

CoRR, 2023

GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation.

[BibT_eX]

[DOI]

CoRR, 2023

Unleashing Infinite-Length Input Capacity for Large-scale Language Models with Self-Controlled Memory System.

[BibT_eX]

[DOI]

CoRR, 2023

Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions and Prospects.

[BibT_eX]

[DOI]

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023

Emotionally Situated Text-to-Speech Synthesis in User-Agent Conversation.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Multimedia, 2023

S2CD: Self-heuristic Speaker Content Disentanglement for Any-to-Any Voice Conversion.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Disentangling the Contribution of Non-native Speech in Automated Pronunciation Assessment.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Language-specific Boundary Learning for Improving Mandarin-English Code-switching Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Knowledge Distillation Approach for Efficient Internal Language Model Estimation.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

AudioQR: Deep Neural Audio Watermarks For QR Code.

[BibT_eX]

[DOI]

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

Dynamics Analysis of Large-Scale Transmission Tower-Line Coupled System under Measured Typhoon Load.

[BibT_eX]

[DOI]

Proceedings of the 6th International Conference on Information Technologies and Electrical Engineering, 2023

Virtual Try-On with Pose-Garment Keypoints Guided Inpainting.

[BibT_eX]

[DOI]

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

LiteG2P: A Fast, Light and High Accuracy Model for Grapheme-to-Phoneme Conversion.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Internal Language Model Estimation Based Adaptive Language Model Fusion for Domain Adaptation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

An ASR-Free Fluency Scoring Approach with Self-Supervised Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Leveraging Phone-Level Linguistic-Acoustic Similarity For Utterance-Level Pronunciation Scoring.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Bytecover3: Accurate Cover Song Identification On Short Queries.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2023

Improving Large-Scale Deep Biasing With Phoneme Features and Text-Only Data in Streaming Transducer.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2023

CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

2022

Sequence-Level Speaker Change Detection With Difference-Based Continuous Integrate-and-Fire.

[BibT_eX]

[DOI]

IEEE Signal Process. Lett., 2022

Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features.

[BibT_eX]

[DOI]

CoRR, 2022

Improving short-video speech recognition using random utterance concatenation.

[BibT_eX]

[DOI]

CoRR, 2022

Unsupervised Video Domain Adaptation: A Disentanglement Perspective.

[BibT_eX]

[DOI]

CoRR, 2022

A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation.

[BibT_eX]

[DOI]

CoRR, 2022

Improving Non-native Word-level Pronunciation Scoring with Phone-level Mixup Data Augmentation and Multi-source Information.

[BibT_eX]

[DOI]

CoRR, 2022

S3T: Self-Supervised Pre-training with Swin Transformer for Music Classification.

[BibT_eX]

[DOI]

CoRR, 2022

Improving Contextual Representation with Gloss Regularized Pre-training.

[BibT_eX]

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, 2022

GIO: A Timbre-informed Approach for Pitch Tracking in Highly Noisy Environments.

[BibT_eX]

[DOI]

Proceedings of the ICMR '22: International Conference on Multimedia Retrieval, Newark, NJ, USA, June 27, 2022

Synthesising Audio Adversarial Examples for Automatic Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14, 2022

Importance Prioritized Policy Distillation.

[BibT_eX]

[DOI]

Proceedings of the KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14, 2022

Latent feature augmentation for chorus detection.

[BibT_eX]

[DOI]

Proceedings of the 23rd International Society for Music Information Retrieval Conference, 2022

Towards high-fidelity singing voice conversion with acoustic reference and contrastive predictive coding.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

A Transfer and Multi-Task Learning based Approach for MOS Prediction.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Bring dialogue-context into RNN-T for streaming ASR.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Using Fluency Representation Learned from Sequential Raw Features for Improving Non-native Fluency Scoring.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Token-level Speaker Change Detection Using Speaker Difference and Speech Content via Continuous Integrate-and-fire.

[BibT_eX]

[DOI]

Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

BiFSMN: Binary Neural Network for Keyword Spotting.

[BibT_eX]

[DOI]

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022

S3T: Self-Supervised Pre-Training with Swin Transformer For Music Classification.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Towards Using Clothes Style Transfer for Scenario-Aware Person Video Generation.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

The Volcspeech System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Language Adaptive Cross-Lingual Speech Representation Learning with Sparse Sharing Sub-Networks.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Improving Pseudo-Label Training For End-To-End Speech Recognition Using Gradient Mask.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Improving End-to-End Contextual Speech Recognition with Fine-Grained Contextual Knowledge Selection.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

Bytecover2: Towards Dimensionality Reduction of Latent Embedding for Efficient Cover Song Identification.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2022

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection.

[BibT_eX]

[DOI]

Taylor Berg-Kirkpatrick

Shlomo Dubnov

Proceedings of the IEEE International Conference on Acoustics, 2022

Dynamic Transfer Gaussian Process Regression.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022

Zero-Shot Audio Source Separation through Query-Based Learning from Weakly-Labeled Data.

[BibT_eX]

[DOI]

Taylor Berg-Kirkpatrick

Shlomo Dubnov

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021

Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech.

[BibT_eX]

[DOI]

CoRR, 2021

Towards Realistic Visual Dubbing with Heterogeneous Sources.

[BibT_eX]

[DOI]

Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders.

[BibT_eX]

[DOI]

Proceedings of the 12th International Symposium on Chinese Spoken Language Processing, 2021

Fine-Grained Prosody Modeling in Neural Speech Synthesis Using ToBI Representation.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

HMM-Free Encoder Pre-Training for Streaming RNN Transducer.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Emitting Word Timings with HMM-Free End-to-End System in Automatic Speech Recognition.

[BibT_eX]

[DOI]

Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Improving RNN Transducer Modeling for Small-Footprint Keyword Spotting.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

A Chapter-Wise Understanding System for Text-To-Speech in Chinese Novels.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

PPG-Based Singing Voice Conversion with Adversarial Representation Learning.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Rule-Embedded Network for Audio-Visual Voice Activity Detection in Live Musical Video Streams.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

An Hrnet-Blstm Model With Two-Stage Training For Singing Melody Extraction.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Singing Melody Extraction from Polyphonic Music based on Spectral Correlation Modeling.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

Bytecover: Cover Song Identification Via Multi-Loss Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2021

2020

Improving RNN transducer with normalized jointer network.

[BibT_eX]

[DOI]

CoRR, 2020

Dynamic latency speech recognition with asynchronous revision.

[BibT_eX]

[DOI]

CoRR, 2020

Contrastive Unsupervised Learning for Audio Fingerprinting.

[BibT_eX]

[DOI]

CoRR, 2020

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech.

[BibT_eX]

[DOI]

CoRR, 2020

A Hybrid Text Normalization System Using Multi-Head Self-Attention For Mandarin.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

A Unified Sequence-to-Sequence Front-End Model for Mandarin Text-to-Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

2019

Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis.

[BibT_eX]

[DOI]

Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

2017

Deep LSTM for Large Vocabulary Continuous Speech Recognition.

[BibT_eX]

[DOI]

CoRR, 2017

Frame Stacking and Retaining for Recurrent Neural Network Acoustic Model.

[BibT_eX]

[DOI]

CoRR, 2017

Exponential Moving Average Model in Parallel Speech Recognition Training.

[BibT_eX]

[DOI]

CoRR, 2017

2012

Unsupervised training of subspace gaussian mixture models for conversational telephone speech recognition.

[BibT_eX]

[DOI]

Zejun Ma

Xiaorui Wang

Bo Xu

Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

2011

Fusing Multiple Confidence Measures for Chinese Spoken Term Detection.

[BibT_eX]

[DOI]

Zejun Ma

Xiaorui Wang

Bo Xu

Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011

An Empirical Study of Multilingual Spoken Term Detection.

[BibT_eX]

[DOI]

Zejun Ma

Xiaorui Wang

Bo Xu

Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011

2010

Distributed link-aware rate allocation for R-D optimal multiple video streaming over wireless networks.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Wireless Communications and Signal Processing, 2010

Zejun Ma

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...