now publishers - Speech-and-Text Transformer: Exploiting Unpaired Text for End-to-End Speech Recognition

APSIPA Transactions on Signal and Information Processing > Vol 12 > Issue 1

Speech-and-Text Transformer: Exploiting Unpaired Text for End-to-End Speech Recognition

Qinyi Wang, National University of Singapore, Singapore, qinyi@u.nus.edu , Xinyuan Zhou, Shanghai Normal University, China, Haizhou Li, National University of Singapore, Singapore and The Chinese University of Hong Kong (Shenzhen), China

Suggested Citation

Qinyi Wang, Xinyuan Zhou and Haizhou Li (2023), "Speech-and-Text Transformer: Exploiting Unpaired Text for End-to-End Speech Recognition", APSIPA Transactions on Signal and Information Processing: Vol. 12: No. 1, e27. http://dx.doi.org/10.1561/116.00000001

Publication Date: 08 May 2023

Subjects

Keywords

Speech recognition, language model, unpaired data, transformer, semi-supervised learning

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 2656 times

In this article:

Abstract

End-to-end automatic speech recognition (ASR) models are typically data-hungry, which depend on a large paired speech-text dataset for the models to be effective. It remains an active area how to increase the linguistic competence of such ASR models with unpaired text data. The conventional techniques that employ an external language model (LM) suffer from high decoding complexity. Pre-training methods have problems of catastrophic forgetting and model capacity gap between the pre-trained modules and the actual tasks. This paper introduces a speech-and-text Transformer to leverage unpaired text and address the above issues. The decoder of the proposed speech-and-text Transformer contains three parallel branches to learn strong text representations from unpaired text and reduce the mismatch between the speech and text representations. An on-demand dual-modality attention mechanism is proposed to automatically select one or two modalities to learn from. Besides, we introduce a novel alternate training algorithm to load speech and text batches alternately and accumulate their gradients. The proposed model is trained with an auxiliary language modeling task. Intra-domain and cross-domain speech recognition experiments are conducted on AISHELL-1, LibriSpeech, and WenetSpeech corpora. Results show competitive performance to the conventional shallow fusion method with negligible computation overheads during inference.

DOI:10.1561/116.00000001

Introduction
Preliminaries
Speech-and-Text Transformer
Relations to Prior Work
Experimental Setup
Results
Analysis
Conclusion
References

Speech-and-Text Transformer: Exploiting Unpaired Text for End-to-End Speech Recognition

Share

Journal details

Abstract