now publishers - End-to-end recognition of streaming Japanese speech using CTC and local attention

APSIPA Transactions on Signal and Information Processing > Vol 9 > Issue 1

End-to-end recognition of streaming Japanese speech using CTC and local attention

Jiahao Chen, Tokushima University, Japan, Ryota Nishimura, Tokushima University, Japan, Norihide Kitaoka, Toyohashi University of Technology, Japan, kitaoka@tut.jp

Suggested Citation

Jiahao Chen, Ryota Nishimura and Norihide Kitaoka (2020), "End-to-end recognition of streaming Japanese speech using CTC and local attention", APSIPA Transactions on Signal and Information Processing: Vol. 9: No. 1, e25. http://dx.doi.org/10.1017/ATSIP.2020.23

Publication Date: 23 Nov 2020

Subjects

Keywords

CTC, Local attention, Speech recognition, Streaming recognition

Journal details

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 2736 times

In this article:

Abstract

Many end-to-end, large vocabulary, continuous speech recognition systems are now able to achieve better speech recognition performance than conventional systems. Most of these approaches are based on bidirectional networks and sequence-to-sequence modeling however, so automatic speech recognition (ASR) systems using such techniques need to wait for an entire segment of voice input to be entered before they can begin processing the data, resulting in a lengthy time-lag, which can be a serious drawback in some applications. An obvious solution to this problem is to develop a speech recognition algorithm capable of processing streaming data. Therefore, in this paper we explore the possibility of a streaming, online, ASR system for Japanese using a model based on unidirectional LSTMs trained using connectionist temporal classification (CTC) criteria, with local attention. Such an approach has not been well investigated for use with Japanese, as most Japanese-language ASR systems employ bidirectional networks. The best result for our proposed system during experimental evaluation was a character error rate of 9.87%.

DOI:10.1017/ATSIP.2020.23

I. INTRODUCTION
II. E2E SPEECH RECOGNITION
III. DETAILS OF OUR APPROACH
IV. EXPERIMENTAL SETUP
V. RESULTS
VI. CONCLUSIONS

End-to-end recognition of streaming Japanese speech using CTC and local attention

Share

Journal details

Abstract