This is published under the terms of CC BY-NC.
Downloaded: 967 times
End-to-end systems have demonstrated state-of-the-art performance on many tasks related to automatic speech recognition (ASR) and dialect identification (DID). In this paper, we propose multi-task learning of Japanese DID and multi-dialect ASR (MD-ASR) systems with end-to-end models. Since Japanese dialects have variety in both linguistic and acoustic aspects of each dialect, Japanese DID requires simultaneously considering linguistic and acoustic features. One solution realizing Japanese DID using these features is to use transcriptions from ASR when performing DID. However, transcribing Japanese multi-dialect speech into text is regarded as a challenging task in ASR because there are big gaps in linguistic and acoustic features between a dialect and standard Japanese. One solution is dialect-aware ASR modeling, which means DID is performed with ASR. Therefore, the multi-task learning framework of Japanese DID and ASR is proposed to represent the dependency of them. We explore three systems as part of the proposed framework, changing the order in which DID and ASR are performed. In the experiments, Japanese multi-dialect ASR and DID tests were conducted on our home-made Japanese multi-dialect database and a standard Japanese database. The proposed transformer-based systems outperformed the conventional single task systems on both DID and ASR tests.