APSIPA Transactions on Signal and Information Processing > Vol 12 > Issue 1

Speech-and-Text Transformer: Exploiting Unpaired Text for End-to-End Speech Recognition

Qinyi Wang, National University of Singapore, Singapore, qinyi@u.nus.edu , Xinyuan Zhou, Shanghai Normal University, China, Haizhou Li, National University of Singapore, Singapore and The Chinese University of Hong Kong (Shenzhen), China
Suggested Citation
Qinyi Wang, Xinyuan Zhou and Haizhou Li (2023), "Speech-and-Text Transformer: Exploiting Unpaired Text for End-to-End Speech Recognition", APSIPA Transactions on Signal and Information Processing: Vol. 12: No. 1, e27. http://dx.doi.org/10.1561/116.00000001

Publication Date: 08 May 2023
© 2023 Q. Wang, X. Zhou and H. Li
Speech recognitionlanguage modelunpaired datatransformersemi-supervised learning


Open Access

This is published under the terms of CC BY-NC.

Downloaded: 1214 times

In this article:
Speech-and-Text Transformer 
Relations to Prior Work 
Experimental Setup 


End-to-end automatic speech recognition (ASR) models are typically data-hungry, which depend on a large paired speech-text dataset for the models to be effective. It remains an active area how to increase the linguistic competence of such ASR models with unpaired text data. The conventional techniques that employ an external language model (LM) suffer from high decoding complexity. Pre-training methods have problems of catastrophic forgetting and model capacity gap between the pre-trained modules and the actual tasks. This paper introduces a speech-and-text Transformer to leverage unpaired text and address the above issues. The decoder of the proposed speech-and-text Transformer contains three parallel branches to learn strong text representations from unpaired text and reduce the mismatch between the speech and text representations. An on-demand dual-modality attention mechanism is proposed to automatically select one or two modalities to learn from. Besides, we introduce a novel alternate training algorithm to load speech and text batches alternately and accumulate their gradients. The proposed model is trained with an auxiliary language modeling task. Intra-domain and cross-domain speech recognition experiments are conducted on AISHELL-1, LibriSpeech, and WenetSpeech corpora. Results show competitive performance to the conventional shallow fusion method with negligible computation overheads during inference.