APSIPA Transactions on Signal and Information Processing > Vol 8 > Issue 1

Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling

Prashanth Gurunath Shivakumar, University of Southern California, USA, Haoqi Li, University of Southern California, USA, Kevin Knight, University of Southern California, USA, Panayiotis Georgiou, University of Southern California, USA, georgiou@sipi.usc.edu
 
Suggested Citation
Prashanth Gurunath Shivakumar, Haoqi Li, Kevin Knight and Panayiotis Georgiou (2019), "Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling", APSIPA Transactions on Signal and Information Processing: Vol. 8: No. 1, e8. http://dx.doi.org/10.1017/ATSIP.2018.31

Publication Date: 01 Feb 2019
© 2019 Prashanth Gurunath Shivakumar, Haoqi Li, Kevin Knight and Panayiotis Georgiou
 
Subjects
 
Keywords
Error correctionSpeech recognitionPhrase-based context modelingNoise channel estimationNeural Network Language Model
 

Share

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 2161 times

In this article:
I. INTRODUCTION 
II. HYPOTHESES 
III. METHODOLOGY 
IV. EXPERIMENTAL SETUP 
V. RESULTS AND DISCUSSION 
VI. CONCLUSIONS AND FUTURE WORK 

Abstract

Automatic speech recognition (ASR) systems often make unrecoverable errors due to subsystem pruning (acoustic, language and pronunciation models); for example, pruning words due to acoustics using short-term context, prior to rescoring with long-term context based on linguistics. In this work, we model ASR as a phrase-based noisy transformation channel and propose an error correction system that can learn from the aggregate errors of all the independent modules constituting the ASR and attempt to invert those. The proposed system can exploit long-term context using a neural network language model and can better choose between existing ASR output possibilities as well as re-introduce previously pruned or unseen (Out-Of-Vocabulary) phrases. It provides corrections under poorly performing ASR conditions without degrading any accurate transcriptions; such corrections are greater on top of out-of-domain and mismatched data ASR. Our system consistently provides improvements over the baseline ASR, even when baseline is further optimized through Recurrent Neural Network (RNN) language model rescoring. This demonstrates that any ASR improvements can be exploited independently and that our proposed system can potentially still provide benefits on highly optimized ASR. Finally, we present an extensive analysis of the type of errors corrected by our system.

DOI:10.1017/ATSIP.2018.31