Alexey Karpov

Orcid: 0000-0003-3424-652X

  • Russian Academy of Sciences, St. Petersburg Institute for Informatics and Automation, Russia

According to our database1, Alexey Karpov authored at least 122 papers between 2005 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.




In proceedings 
PhD thesis 


Online presence:



Gated Siamese Fusion Network based on multimodal deep and hand-crafted features for personality traits assessment.
Pattern Recognit. Lett., 2024

OCEAN-AI framework with EmoFormer cross-hemiface attention approach for personality traits assessment.
Expert Syst. Appl., 2024

Audio-visual speech recognition based on regulated transformer and spatio-temporal fusion strategy for driver assistive systems.
Expert Syst. Appl., 2024

Audio-Visual Compound Expression Recognition Method based on Late Modality Fusion and Rule-based Decision.
CoRR, 2024

SUN Team's Contribution to ABAW 2024 Competition: Audio-visual Valence-Arousal Estimation and Expression Recognition.
CoRR, 2024

Cross-Cultural Automatic Depression Detection Based on Audio Signals.
Proceedings of the Speech and Computer - 26th International Conference, 2024

OpenAV: Bilingual Dataset for Audio-Visual Voice Control of a Computer for Hand Disabled People.
Proceedings of the Speech and Computer - 26th International Conference, 2024

A Cross-Multi-modal Fusion Approach for Enhanced Engagement Recognition.
Proceedings of the Speech and Computer - 26th International Conference, 2024

Audio-Visual Speech Recognition In-The-Wild: Multi-Angle Vehicle Cabin Corpus and Attention-Based Method.
Proceedings of the IEEE International Conference on Acoustics, 2024

Zero-Shot Audio-Visual Compound Expression Recognition Method based on Emotion Probability Fusion.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Multi-modal Arousal and Valence Estimation under Noisy Conditions.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

EMO-AVSR: Two-Level Approach for Audio-Visual Emotional Speech Recognition.
Proceedings of the Speech and Computer - 25th International Conference, 2023

Multimodal Personality Traits Assessment (MuPTA) Corpus: The Impact of Spontaneous and Read Speech.
Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023

End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild.
Multimodal Technol. Interact., 2022

In search of a robust facial expressions recognition model: A large-scale visual cross-corpus study.
Neurocomputing, 2022

Editorial: Towards Omnipresent and Smart Speech Assistants.
Frontiers Comput. Sci., 2022

Physical Training Reverses the Impaired Cardiac Autonomic Control and Exercise Tolerance Induced by Right-Side Vagal Denervation.
IEEE Access, 2022

Self-Configuring Genetic Programming Feature Generation in Affect Recognition Tasks.
Proceedings of the Speech and Computer - 24th International Conference, 2022

Multi-level Fusion of Fisher Vector Encoded BERT and Wav2vec 2.0 Embeddings for Native Language Identification.
Proceedings of the Speech and Computer - 24th International Conference, 2022

DyCoDa: A Multi-modal Data Collection of Multi-user Remote Survival Game Recordings.
Proceedings of the Speech and Computer - 24th International Conference, 2022

RUSAVIC Corpus: Russian Audio-Visual Speech in Cars.
Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

Complex Paralinguistic Analysis of Speech: Predicting Gender, Emotions and Deception in a Hierarchical Framework.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

DAVIS: Driver's Audio-Visual Speech recognition.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

MIDriveSafely: Multimodal Interaction for Drive Safely.
Proceedings of the International Conference on Multimodal Interaction, 2022

Medical exoskeleton "Remotion" with an intelligent control system: Modeling, implementation, and testing.
Simul. Model. Pract. Theory, 2021

A Bimodal Approach for Speech Emotion Recognition using Audio and Text.
J. Internet Serv. Inf. Secur., 2021

Heterogeneous Face Recognition from Facial Sketches.
Informatica (Slovenia), 2021

Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin.
IEEE Access, 2021

X-Bridge: Image-to-Image Translation with Reconstruction Capabilities.
Proceedings of the Speech and Computer - 23rd International Conference, 2021

Deep Learning Based Engagement Recognition in Highly Imbalanced Data.
Proceedings of the Speech and Computer - 23rd International Conference, 2021

Ensemble-Within-Ensemble Classification for Escalation Prediction from Speech.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Annotation Confidence vs. Training Sample Size: Trade-Off Solution for Partially-Continuous Categorical Emotion Recognition.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Using Complexity-Identical Human- and Machine-Directed Utterances to Investigate Addressee Detection for Spoken Dialogue Systems.
Sensors, 2020

An Audio-Video Deep and Transfer Learning Framework for Multimodal Emotion Recognition in the wild.
CoRR, 2020

Lipreading with LipsID.
Proceedings of the Speech and Computer - 22nd International Conference, 2020

Emotion Recognition and Sentiment Analysis of Extemporaneous Speech Transcriptions in Russian.
Proceedings of the Speech and Computer - 22nd International Conference, 2020

Class-based LSTM Russian Language Model with Linguistic Information.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020

TheRuSLan: Database of Russian Sign Language.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020

Is Everything Fine, Grandma? Acoustic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Ensembling End-to-End Deep Models for Computational Paralinguistics Tasks: ComParE 2020 Mask and Breathing Sub-Challenges.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Combining Clustering and Functionals based Acoustic Feature Representations for Classification of Baby Sounds.
Proceedings of the Companion Publication of the 2020 International Conference on Multimodal Interaction, 2020

Speaking Style Based Apparent Personality Recognition.
Proceedings of the Speech and Computer - 21st International Conference, 2019

Time-Continuous Emotion Recognition Using Spectrogram Based CNN-RNN Modelling.
Proceedings of the Speech and Computer - 21st International Conference, 2019

Cross-Corpus Data Augmentation for Acoustic Addressee Detection.
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, 2019

Human-Robot Interaction with Smart Shopping Trolley Using Sign Language: Data Collection.
Proceedings of the IEEE International Conference on Pervasive Computing and Communications Workshops, 2019

Predicting Depression and Emotions in the Cross-roads of Cultures, Para-linguistics, and Non-linguistics.
Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, 2019

Applying Ensemble Learning Techniques and Neural Networks to Deceptive and Truthful Information Detection Task in the Flow of Speech.
Proceedings of the Intelligent Distributed Computing XIII, 2019

Lower Limbs Exoskeleton Control System Based on Intelligent Human-Machine Interface.
Proceedings of the Intelligent Distributed Computing XIII, 2019

Smartphone-Based Driver Support in Vehicle Cabin: Human-Computer Interaction Interface.
Proceedings of the Interactive Collaborative Robotics - 4th International Conference, 2019

Hierarchical Two-level Modelling of Emotional States in Spoken Dialog Systems.
Proceedings of the IEEE International Conference on Acoustics, 2019

Speech communication integrated with other modalities.
J. Multimodal User Interfaces, 2018

Multimodal speech recognition: increasing accuracy using high speed video data.
J. Multimodal User Interfaces, 2018

Efficient and effective strategies for cross-corpus acoustic emotion recognition.
Neurocomputing, 2018

Comparative Analysis of Classification Methods for Automatic Deception Detection in Speech.
Proceedings of the Speech and Computer - 20th International Conference, 2018

LipsID Using 3D Convolutional Neural Networks.
Proceedings of the Speech and Computer - 20th International Conference, 2018

Context Modeling for Cross-Corpus Dimensional Acoustic Emotion Recognition: Challenges and Mixup.
Proceedings of the Speech and Computer - 20th International Conference, 2018

LSTM Based Cross-corpus and Cross-task Acoustic Emotion Recognition.
Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018

Sign Language Numeral Gestures Recognition Using Convolutional Neural Network.
Proceedings of the Interactive Collaborative Robotics - Third International Conference, 2018

Signal Processing Platforms and Algorithms for Real-Life Communications and Listening to Digital Audio.
J. Electr. Comput. Eng., 2017

Emotion, age, and gender classification in children's speech by humans and machines.
Comput. Speech Lang., 2017

A study of neural network Russian language models for automatic continuous speech recognition systems.
Autom. Remote. Control., 2017

Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions.
Proceedings of the Speech and Computer - 19th International Conference, 2017

Semi-automatic Facial Key-Point Dataset Creation.
Proceedings of the Speech and Computer - 19th International Conference, 2017

Are You Addressing Me? Multimodal Addressee Detection in Human-Human-Computer Conversations.
Proceedings of the Speech and Computer - 19th International Conference, 2017

Introducing Weighted Kernel Classifiers for Handling Imbalanced Paralinguistic Corpora: Snoring, Addressee and Cold.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Speech and Text Analysis for Multimodal Addressee Detection in Human-Human-Computer Interaction.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Facing Face Recognition with ResNet: Round One.
Proceedings of the Interactive Collaborative Robotics - Second International Conference, 2017

Towards Automatic Recognition of Sign Language Gestures Using Kinect 2.0.
Proceedings of the Universal Access in Human-Computer Interaction. Designing Novel Interactions, 2017

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech.
Proceedings of the Speech and Computer - 18th International Conference, 2016

DNN-Based Acoustic Modeling for Russian Speech Recognition Using Kaldi.
Proceedings of the Speech and Computer - 18th International Conference, 2016

Automatic Technologies for Processing Spoken Sign Languages.
Proceedings of the SLTU-2016, 2016

Language Models with RNNs for Rescoring Hypotheses of Russian ASR.
Proceedings of the Advances in Neural Networks - ISNN 2016, 2016

Robust Acoustic Emotion Recognition Based on Cascaded Normalization and Extreme Learning Machines.
Proceedings of the Advances in Neural Networks - ISNN 2016, 2016

Fusing Acoustic Feature Representations for Computational Paralinguistics Tasks.
Proceedings of the 17th Annual Conference of the International Speech Communication Association, 2016

An Analysis of Visual Faces Datasets.
Proceedings of the Interactive Collaborative Robotics - First International Conference, 2016

Multimodal Information Coding System for Wearable Devices of Advanced Uniform.
Proceedings of the Human Interface and the Management of Information: Information, Design and Interaction, 2016

Bimodal Speech Recognition Fusing Audio-Visual Modalities.
Proceedings of the Human-Computer Interaction. Interaction Platforms and Techniques, 2016

EmoChildRu: Emotional Child Russian Speech Corpus.
Proceedings of the Speech and Computer - 17th International Conference, 2015

A Comparison of RNN LM and FLM for Russian Speech Recognition.
Proceedings of the Speech and Computer - 17th International Conference, 2015

Fisher vectors with cascaded normalization for paralinguistic analysis.
Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015

Automatic Analysis of Speech and Acoustic Events for Ambient Assisted Living.
Proceedings of the Universal Access in Human-Computer Interaction. Access to Interaction, 2015

Large vocabulary Russian speech recognition using syntactico-statistical language modeling.
Speech Commun., 2014

Automatic speech recognition for under-resourced languages: A survey.
Speech Commun., 2014

Introduction to the special issue on processing under-resourced languages.
Speech Commun., 2014

An automatic multimodal speech recognition system with audio and video information.
Autom. Remote. Control., 2014

Study of Morphological Factors of Factored Language Models for Russian ASR.
Proceedings of the Speech and Computer - 16th International Conference, 2014

A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera.
Proceedings of the Speech and Computer - 16th International Conference, 2014

Rescoring n-best lists for Russian speech recognition using factored language models.
Proceedings of the 4th Workshop on Spoken Language Technologies for Under-resourced Languages, 2014

Audio-visual signal processing in a multimodal assisted living environment.
Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014

A Universal Assistive Technology with Multimodal Input and Multimedia Output Interfaces.
Proceedings of the Universal Access in Human-Computer Interaction. Design and Development Methods for Universal Access, 2014

Modeling of Pronunciation, Language and Nonverbal Units at Conversational Russian Speech Recognition.
Int. J. Comput. Sci. Appl., 2013

Lexicon Size and Language Model Order Optimization for Russian LVCSR.
Proceedings of the Speech and Computer - 15th International Conference, 2013

Multimodal Synthesizer for Russian and Czech Sign Languages and Audio-Visual Speech.
Proceedings of the Universal Access in Human-Computer Interaction. User and Context Diversity, 2013

Analysis of the Quotation Corpus of the Russian Wiktionary.
Res. Comput. Sci., 2012

Speech recognition for east Slavic languages: the case of Russian.
Proceedings of the Third Workshop on Spoken Language Technologies for Under-resourced Languages, 2012

State-of-the-art speech recognition technologies for Russian language.
Proceedings of the Joint International Conference on Human-Centered Computer Environments, 2012

Analysis of Long-distance Word Dependencies and Pronunciation Variability at Conversational Russian Speech Recognition.
Proceedings of the Federated Conference on Computer Science and Information Systems, 2012

Automatic fingersign-to-speech translation system.
J. Multimodal User Interfaces, 2011

Very Large Vocabulary ASR for Spoken Russian with Syntactic and Morphemic Analysis.
Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011

Influenсe of Phone-Viseme Temporal Correlations on Audiovisual STT and TTS Performance.
Proceedings of the 17th International Congress of Phonetic Sciences, 2011

An Assistive Bi-modal User Interface Integrating Multi-channel Speech Recognition and Computer Vision.
Proceedings of the Human-Computer Interaction. Interaction Techniques and Environments, 2011

Client and Speech Detection System for Intelligent Infokiosk.
Proceedings of the Text, Speech and Dialogue, 13th International Conference, 2010

A Video Monitoring Model with a Distributed Camera System for the Smart Space.
Proceedings of the Smart Spaces and Next Generation Wired/Wireless Networking, 2010

Multichannel System of Audio-Visual Support of Remote Mobile Participant at E-Meeting.
Proceedings of the Smart Spaces and Next Generation Wired/Wireless Networking, 2010

Viseme-dependent weight optimization for CHMM-based audio-visual speech recognition.
Proceedings of the 11th Annual Conference of the International Speech Communication Association, 2010

Multimodal human-robot interaction.
Proceedings of the International Conference on Ultra Modern Telecommunications, 2010

Multimodal Human Computer Interaction with MIDAS Intelligent Infokiosk.
Proceedings of the 20th International Conference on Pattern Recognition, 2010

Audio-visual speech asynchrony modeling in a talking head.
Proceedings of the 10th Annual Conference of the International Speech Communication Association, 2009

Speech activity and speaker novelty detection methods for meeting processing.
Proceedings of the International Conference on Ultra Modern Telecommunications, 2009

Multimodal control via heterogeneous devices.
Proceedings of the International Conference on Ultra Modern Telecommunications, 2009

Designing Cognition-Centric Smart Room Predicting Inhabitant Activities.
Proceedings of the Foundations of Augmented Cognition. Neuroergonomics and Operational Neuroscience, 2009

Multimodal user interface for the communication of the disabled.
J. Multimodal User Interfaces, 2008

A Semi-automatic Wizard of Oz Technique for Let'sFly Spoken Dialogue System.
Proceedings of the Text, Speech and Dialogue, 11th International Conference, 2008

Comparison of two different similar speech and gestures multimodal interfaces.
Proceedings of the 2008 16th European Signal Processing Conference, 2008

ICANDO: Low cost multimodal interface for hand disabled people.
J. Multimodal User Interfaces, 2007

Multi-modal system ICANDO: intellectual computer assistant for disabled operators.
Proceedings of the Ninth International Conference on Spoken Language Processing, 2006

Combined Gesture-Speech Analysis and Speech Driven Gesture Synthesis.
Proceedings of the 2006 IEEE International Conference on Multimedia and Expo, 2006

ICanDo: Intellectual Computer AssistaNt for Disabled Operators.
Proceedings of the 14th European Signal Processing Conference, 2006

Speech Interface for Internet Service 'Yellow Pages'.
Proceedings of the Intelligent Information Processing and Web Mining, 2005

Assistive multimodal system based on speech recognition and head tracking.
Proceedings of the 13th European Signal Processing Conference, 2005

Multimodal system for hands-free PC control.
Proceedings of the 13th European Signal Processing Conference, 2005
