Sami Virpioja

Orcid: 0000-0002-3568-150X

According to our database1, Sami Virpioja authored at least 58 papers between 2005 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.



In proceedings 
PhD thesis 


Online presence:



Democratizing neural machine translation with OPUS-MT.
Lang. Resour. Evaluation, June, 2024

Uncertainty-Aware Natural Language Inference with Stochastic Weight Averaging.
Proceedings of the 24th Nordic Conference on Computational Linguistics, 2023

Unsupervised Feature Selection for Effective Parallel Corpus Filtering.
Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 2023

Democratizing Machine Translation with OPUS-MT.
CoRR, 2022

Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian.
Comput. Speech Lang., 2021

Advances in subword-based HMM-DNN speech recognition across languages.
Comput. Speech Lang., 2021

Grapheme-Based Cross-Language Forced Alignment: Results with Uralic Languages.
Proceedings of the 23rd Nordic Conference on Computational Linguistics, 2021

Boosting Neural Machine Translation from Finnish to Northern Sámi with Rule-Based Backtranslation.
Proceedings of the 23rd Nordic Conference on Computational Linguistics, 2021

Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation.
Mach. Transl., 2020

The University of Helsinki and Aalto University submissions to the WMT 2020 news and low-resource translation tasks.
Proceedings of the Fifth Conference on Machine Translation, 2020

Effects of Language Relatedness for Cross-lingual Transfer Learning in Character-Based Language Models.
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages and Collaboration and Computing for Under-Resourced Languages, 2020

Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020

OpusTools and Parallel Corpus Diagnostics.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020

FinChat: Corpus and Evaluation Setup for Finnish Chat Conversations on Everyday Topics.
Proceedings of the 21st Annual Conference of the International Speech Communication Association, 2020

Controlling the Imprint of Passivization and Negation in Contextualized Representations.
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020

OpusFilter: A Configurable Parallel Corpus Filtering Toolbox.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020

The University of Helsinki Submissions to the WMT19 News Translation Task.
Proceedings of the Fourth Conference on Machine Translation, 2019

The University of Helsinki Submissions to the WMT19 Similar Language Translation Task.
Proceedings of the Fourth Conference on Machine Translation, 2019

Subword RNNLM Approximations for Out-Of-Vocabulary Keyword Search.
Proceedings of the 20th Annual Conference of the International Speech Communication Association, 2019

Using Statistical Models of Morphology in the Search for Optimal Units of Representation in the Human Mental Lexicon.
Cogn. Sci., 2018

Cognate-aware morphological segmentation for multilingual neural translation.
Proceedings of the Third Conference on Machine Translation: Shared Task Papers, 2018

First-Pass Techniques for Very Large Vocabulary Speech Recognition ff Morphologically Rich Languages.
Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018

Automatic Speech Recognition With Very Large Conversational Finnish and Estonian Vocabularies.
IEEE ACM Trans. Audio Speech Lang. Process., 2017

Extending hybrid word-character neural machine translation with multi-task learning of morphological analysis.
Proceedings of the Second Conference on Machine Translation, 2017

Improved Subword Modeling for WFST-Based Speech Recognition.
Proceedings of the 18th Annual Conference of the International Speech Communication Association, 2017

Aalto system for the 2017 Arabic multi-genre broadcast challenge.
Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

Character-based units for unlimited vocabulary continuous speech recognition.
Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, 2017

A Comparative Study of Minimally Supervised Morphological Segmentation.
Comput. Linguistics, 2016

Hybrid Morphological Segmentation for Phrase-Based Machine Translation.
Proceedings of the First Conference on Machine Translation, 2016

Class n-Gram Models for Very Large Vocabulary Speech Recognition of Finnish and Estonian.
Proceedings of the Statistical Language and Speech Processing, 2016

LeBLEU: N-gram-based Translation Evaluation Score for Morphologically Complex Languages.
Proceedings of the Tenth Workshop on Statistical Machine Translation, 2015

Tuning Phrase-Based Segmented Translation for a Morphologically Complex Target Language.
Proceedings of the Tenth Workshop on Statistical Machine Translation, 2015

Morfessor 2.0: Toolkit for statistical morphological segmentation.
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014

Painless Semi-Supervised Morphological Segmentation using Conditional Random Fields.
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014

Morfessor FlatCat: An HMM-Based Method for Unsupervised and Semi-Supervised Learning of Morphology.
Proceedings of the COLING 2014, 2014

Supervised Morphological Segmentation in a Low-Resource Learning Setting using Conditional Random Fields.
Proceedings of the Seventeenth Conference on Computational Natural Language Learning, 2013

Learning a subword vocabulary based on unigram likelihood.
Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013

Evaluating vector space models with canonical correlation analysis.
Nat. Lang. Eng., 2012

Empirical Comparison of Evaluation Methods for Unsupervised Learning of Morphology.
Trait. Autom. des Langues, 2011

Evaluating the effect of word frequencies in a probabilistic generative model of morphology.
Proceedings of the 18th Nordic Conference of Computational Linguistics, 2011

Predicting Reaction Times in Word Recognition by Unsupervised Learning of Morphology.
Proceedings of the Artificial Neural Networks and Machine Learning - ICANN 2011, 2011

Applying Morphological Decompositions to Statistical Machine Translation.
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, 2010

Morpho Challenge 2005-2010: Evaluations and Results.
Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, 2010

Semi-Supervised Learning of Concatenative Morphology.
Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, 2010

Language Identification of Short Text Segments with N-gram Models.
Proceedings of the International Conference on Language Resources and Evaluation, 2010

Morpho Challenge - Evaluation of algorithms for unsupervised learning of morphology in various tasks and languages.
Proceedings of the Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31, 2009

Minimum Bayes Risk Combination of Translation Hypotheses from Alternative Morphological Decompositions.
Proceedings of the Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31, 2009

Web Augmentation of Language Models for Continuous Speech Recognition of SMS Text Messages.
Proceedings of the EACL 2009, 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, Athens, Greece, March 30, 2009

Unsupervised Morpheme Analysis with Allomorfessor.
Proceedings of the Multilingual Information Access Evaluation I. Text Retrieval Experiments, 2009

Unsupervised Morpheme Discovery with Allomorfessor.
Proceedings of the Working Notes for CLEF 2009 Workshop co-located with the 13th European Conference on Digital Libraries (ECDL 2009) , Corfù, Greece, September 30, 2009

Overview and Results of Morpho Challenge 2009.
Proceedings of the Multilingual Information Access Evaluation I. Text Retrieval Experiments, 2009

Adaptive Translation: Finding Interlingual Mappings Using Self-Organizing Maps.
Proceedings of the Artificial Neural Networks, 2008

Allomorfessor: Towards Unsupervised Morpheme Analysis.
Proceedings of the Working Notes for CLEF 2008 Workshop co-located with the 12th European Conference on Digital Libraries (ECDL 2008) , 2008

On Growing and Pruning Kneser-Ney Smoothed N-Gram Models.
IEEE Trans. Speech Audio Process., 2007

Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner.
Proceedings of Machine Translation Summit XI: Papers, 2007

Unlimited vocabulary speech recognition with morph language models applied to Finnish.
Comput. Speech Lang., 2006

Compact n-gram models by incremental growing and clustering of histories.
Proceedings of the Ninth International Conference on Spoken Language Processing, 2006

Unsupervised Morphology Induction Using Morfessor.
Proceedings of the Finite-State Methods and Natural Language Processing, 2005
