Samuel Thomas

Orcid: 0000-0001-7573-0620

  • IBM Research AI, Thomas J. Watson Research Center, NY, USA
  • Johns Hopkins University, USA (former)

According to our database1, Samuel Thomas authored at least 108 papers between 2006 and 2024.

Collaborative distances:



In proceedings 
PhD thesis 


Online presence:



Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation.
CoRR, 2024

What, When, and Where? Self-Supervised Spatio- Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

ConvKT: Conversation-Level Knowledge Transfer for Context Aware End-to-End Spoken Language Understanding.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

Multi-Speaker Data Augmentation for Improved end-to-end Automatic Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2023

Fine-Grained Textual Knowledge Transfer to Improve RNN Transducers for Speech Recognition and Understanding.
Proceedings of the IEEE International Conference on Acoustics, 2023

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval.
Proceedings of the IEEE International Conference on Acoustics, 2023

Effective Training of RNN Transducer Models on Diverse Sources of Speech and Text Data.
Proceedings of the IEEE International Conference on Acoustics, 2023

Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Extending RNN-T-based speech recognition systems with emotion and language classification.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Global RNN Transducer Models For Multi-dialect Speech Recognition.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Integrating Text Inputs for Training and Adapting RNN Transducer ASR Models.
Proceedings of the IEEE International Conference on Acoustics, 2022

Towards Reducing the Need for Speech Training Data to Build Spoken Language Understanding Systems.
Proceedings of the IEEE International Conference on Acoustics, 2022

Towards End-to-End Integration of Dialog History for Improved Spoken Language Understanding.
Proceedings of the IEEE International Conference on Acoustics, 2022

Improving End-to-end Models for Set Prediction in Spoken Language Understanding.
Proceedings of the IEEE International Conference on Acoustics, 2022

A New Data Augmentation Method for Intent Classification Enhancement and its Application on Spoken Conversation Datasets.
Proceedings of the IEEE International Conference on Acoustics, 2022

Everything at Once - Multi-modal Fusion Transformer for Video Retrieval.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection.
IEEE ACM Trans. Audio Speech Lang. Process., 2021

Routing with Self-Attention for Multimodal Capsule Networks.
CoRR, 2021

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Cascaded Multilingual Audio-Visual Learning from Videos.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Integrating Dialog History into End-to-End Spoken Language Understanding Systems.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Knowledge Distillation Based Training of Universal ASR Source Models for Cross-Lingual Transfer.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

End-to-End Spoken Language Understanding Using Transformer Networks and Self-Supervised Pre-Trained Features.
Proceedings of the IEEE International Conference on Acoustics, 2021

RNN Transducer Models for Spoken Language Understanding.
Proceedings of the IEEE International Conference on Acoustics, 2021

Resource-efficient TDNN Architectures for Audio-visual Speech Recognition.
Proceedings of the 29th European Signal Processing Conference, 2021

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos.
CoRR, 2020

End-to-End Spoken Language Understanding Without Full Transcripts.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Resource-Adaptive Deep Learning for Visual Speech Recognition.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Implicit Transfer of Privileged Acoustic Information in a Generalized Knowledge Distillation Framework.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Transliteration Based Data Augmentation for Training Multilingual ASR Acoustic Models in Low Resource Settings.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Training Spoken Language Understanding Systems with Non-Parallel Speech and Text.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Audio-Assisted Image Inpainting for Talking Faces.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Detection and Recovery of OOVs for Improved English Broadcast News Captioning.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Learning Speaker Aware Offsets for Speaker Adaptation of Neural Networks.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

English Broadcast News Speech Recognition by Humans and Machines.
Proceedings of the IEEE International Conference on Acoustics, 2019

Improvements to N-gram Language Model Using Text Generated from Neural Language Model.
Proceedings of the IEEE International Conference on Acoustics, 2019

Pre-training of Speaker Embeddings for Low-latency Speaker Change Detection in Broadcast News.
Proceedings of the IEEE International Conference on Acoustics, 2019

Grounding Spoken Words in Unlabeled Video.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019

Simplified LSTMS for Speech Recognition.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Semi-Supervised Training and Data Augmentation for Adaptation of Automatic Broadcast News Captioning Systems.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Mixed Bandwidth Acoustic Modeling Leveraging Knowledge Distillation.
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019

Understanding Unequal Gender Classification Accuracy from Face Images.
CoRR, 2018

SimplerVoice: A Key Message & Visual Description Generator System for Illiteracy.
CoRR, 2018

A Recorded Debating Dataset.
Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 2018

Inference-Invariant Transformation of Batch Normalization for Domain Adaptation of Acoustic Models.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Data Augmentation Improves Recognition of Foreign Accented Speech.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

English Conversational Telephone Speech Recognition by Humans and Machines.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Efficient Knowledge Distillation from an Ensemble of Teachers.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Effective joint training of denoising feature space transforms and Neural Network based acoustic models.
Proceedings of the 2017 IEEE International Conference on Acoustics, 2017

Invariant Representations for Noisy Speech Recognition.
CoRR, 2016

Multilingual Data Selection for Low Resource Speech Recognition.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

Domain Adaptation of CNN Based Acoustic Models Under Limited Resource Settings.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

An Investigation on the Use of i-Vectors for Robust ASR.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

CNMF-based acoustic features for noise-robust ASR.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

On the importance of event detection for ASR.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

The IBM BOLT speech transcription system.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Investigating factor analysis features for deep neural networks in noisy speech recognition.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Improvements to the IBM speech activity detection system for the DARPA RATS program.
Proceedings of the 2015 IEEE International Conference on Acoustics, 2015

Annealed dropout training of deep networks.
Proceedings of the 2014 IEEE Spoken Language Technology Workshop, 2014

Deep Order Statistic Networks.
Proceedings of the 2014 IEEE Spoken Language Technology Workshop, 2014

Robust language identification using convolutional neural network features.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions.
Proceedings of the IEEE International Conference on Acoustics, 2014

The IBM speech activity detection system for the DARPA RATS program.
Proceedings of the 14th Annual Conference of the International Speech Communication Association, 2013

Deep neural network features and semi-supervised training for low resource speech recognition.
Proceedings of the IEEE International Conference on Acoustics, 2013

Developing a speaker identification system for the DARPA RATS project.
Proceedings of the IEEE International Conference on Acoustics, 2013

Weak top-down constraints for unsupervised acoustic model training.
Proceedings of the IEEE International Conference on Acoustics, 2013

Adaptation transforms of auto-associative neural networks as features for speaker verification.
Proceedings of the Odyssey 2012: The Speaker and Language Recognition Workshop, 2012

Feature extraction using 2-d autoregressive models for speaker recognition.
Proceedings of the Odyssey 2012: The Speaker and Language Recognition Workshop, 2012

Acoustic and Data-driven Features for Robust Speech Activity Detection.
Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

Data-driven Posterior Features for Low Resource Speech Recognition Applications.
Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

Exploiting Discriminative Point Process Models for Spoken Term Detection.
Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

Intrinsic Spectral Analysis for Zero and High Resource Speech Recognition.
Proceedings of the 13th Annual Conference of the International Speech Communication Association, 2012

Multilingual MLP features for low-resource LVCSR systems.
Proceedings of the 2012 IEEE International Conference on Acoustics, 2012

The subspace Gaussian mixture model - A structured model for speech recognition.
Comput. Speech Lang., 2011

Performance monitoring for robustness in automatic recognition of speechi.
Proceedings of the 2011 Symposium on Machine Learning in Speech and Language Processing, 2011

Mixture of Auto-Associative Neural Networks for Speaker Verification.
Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011

Adaptive Stream Fusion in Multistream Recognition of Speech.
Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011

Rapid Evaluation of Speech Representations for Spoken Term Discovery.
Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011

Speech recognitionwith segmental conditional random fields: A summary of the JHU CLSP 2010 Summer Workshop.
Proceedings of the IEEE International Conference on Acoustics, 2011

MLP based phoneme detectors for Automatic Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2011

A phoneme recognition framework based on auditory spectro-temporal receptive fields.
Proceedings of the 11th Annual Conference of the International Speech Communication Association, 2010

Cross-lingual and multi-stream posterior features for low resource LVCSR systems.
Proceedings of the 11th Annual Conference of the International Speech Communication Association, 2010

A multistream multiresolution framework for phoneme recognition.
Proceedings of the 11th Annual Conference of the International Speech Communication Association, 2010

Subspace Gaussian Mixture Models for speech recognition.
Proceedings of the IEEE International Conference on Acoustics, 2010

Approaches to automatic lexicon learning with limited training examples.
Proceedings of the IEEE International Conference on Acoustics, 2010

A novel estimation of feature-space MLLR for full-covariance models.
Proceedings of the IEEE International Conference on Acoustics, 2010

Comparison of modulation features for phoneme recognition.
Proceedings of the IEEE International Conference on Acoustics, 2010

Robust spectro-temporal features based on autoregressive models of Hilbert envelopes.
Proceedings of the IEEE International Conference on Acoustics, 2010

Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models.
Proceedings of the IEEE International Conference on Acoustics, 2010

Applications of signal analysis using autoregressive models for amplitude modulation.
Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2009

Tandem representations of spectral envelope and modulation frequency features for ASR.
Proceedings of the 10th Annual Conference of the International Speech Communication Association, 2009

Static and dynamic modulation spectrum for speech recognition.
Proceedings of the 10th Annual Conference of the International Speech Communication Association, 2009

Phoneme recognition using spectral envelope and modulation frequency features.
Proceedings of the IEEE International Conference on Acoustics, 2009

Temporal envelope subtraction for robust speech recognition using modulation spectrum.
Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, 2009

Recognition of Reverberant Speech Using Frequency Domain Linear Prediction.
IEEE Signal Process. Lett., 2008

Hilbert Envelope Based Features for Far-Field Speech Recognition.
Proceedings of the Machine Learning for Multimodal Interaction, 5th International Workshop, 2008

Hilbert envelope based spectro-temporal features for phoneme recognition in telephone speech.
Proceedings of the 9th Annual Conference of the International Speech Communication Association, 2008

Front-end for far-field speech recognition based on frequency domain linear prediction.
Proceedings of the 9th Annual Conference of the International Speech Communication Association, 2008

Spectro-temporal features for Automatic Speech Recognition using Linear Prediction in spectral domain.
Proceedings of the 2008 16th European Signal Processing Conference, 2008

Language identification of person names using CF-IOF based weighing function.
Proceedings of the 8th Annual Conference of the International Speech Communication Association, 2007

Natural sounding TTS based on syllable-like units.
Proceedings of the 14th European Signal Processing Conference, 2006
