now publishers - End-to-end Japanese Multi-dialect Speech Recognition and Dialect Identification with Multi-task Learning

APSIPA Transactions on Signal and Information Processing > Vol 11 > Issue 1

End-to-end Japanese Multi-dialect Speech Recognition and Dialect Identification with Multi-task Learning

Ryo Imaizumi, Tokyo Metropolitan University, Japan, sayaka@tmu.ac.jp , Ryo Masumura, NTT Media Intelligence Laboratories, NTT Corporation, Japan, Sayaka Shiota, Tokyo Metropolitan University, Japan, Hitoshi Kiya, Tokyo Metropolitan University, Japan

Suggested Citation

Ryo Imaizumi, Ryo Masumura, Sayaka Shiota and Hitoshi Kiya (2022), "End-to-end Japanese Multi-dialect Speech Recognition and Dialect Identification with Multi-task Learning", APSIPA Transactions on Signal and Information Processing: Vol. 11: No. 1, e4. http://dx.doi.org/10.1561/116.00000045

Publication Date: 29 Mar 2022

Subjects

Keywords

Japanese multi-dialect automatic speech recognition, Japanese dialect identification, multi-task learning, transformer-based encoder-decoder, end-to-end model

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 3324 times

In this article:

Abstract

End-to-end systems have demonstrated state-of-the-art performance on many tasks related to automatic speech recognition (ASR) and dialect identification (DID). In this paper, we propose multi-task learning of Japanese DID and multi-dialect ASR (MD-ASR) systems with end-to-end models. Since Japanese dialects have variety in both linguistic and acoustic aspects of each dialect, Japanese DID requires simultaneously considering linguistic and acoustic features. One solution realizing Japanese DID using these features is to use transcriptions from ASR when performing DID. However, transcribing Japanese multi-dialect speech into text is regarded as a challenging task in ASR because there are big gaps in linguistic and acoustic features between a dialect and standard Japanese. One solution is dialect-aware ASR modeling, which means DID is performed with ASR. Therefore, the multi-task learning framework of Japanese DID and ASR is proposed to represent the dependency of them. We explore three systems as part of the proposed framework, changing the order in which DID and ASR are performed. In the experiments, Japanese multi-dialect ASR and DID tests were conducted on our home-made Japanese multi-dialect database and a standard Japanese database. The proposed transformer-based systems outperformed the conventional single task systems on both DID and ASR tests.

DOI:10.1561/116.00000045

Introduction
Challenges with Multi-dialect Japanese
Transformer-based Network Architecture
Multi-task Learning of Japanese DID and MD-ASR
Experiments
Conclusion
References

End-to-end Japanese Multi-dialect Speech Recognition and Dialect Identification with Multi-task Learning

Share

Journal details

Abstract