This is published under the terms of CC BY-NC.
Downloaded: 985 times
End-to-end automatic speech recognition (ASR) models are typically data-hungry, which depend on a large paired speech-text dataset for the models to be effective. It remains an active area how to increase the linguistic competence of such ASR models with unpaired text data. The conventional techniques that employ an external language model (LM) suffer from high decoding complexity. Pre-training methods have problems of catastrophic forgetting and model capacity gap between the pre-trained modules and the actual tasks. This paper introduces a speech-and-text Transformer to leverage unpaired text and address the above issues. The decoder of the proposed speech-and-text Transformer contains three parallel branches to learn strong text representations from unpaired text and reduce the mismatch between the speech and text representations. An on-demand dual-modality attention mechanism is proposed to automatically select one or two modalities to learn from. Besides, we introduce a novel alternate training algorithm to load speech and text batches alternately and accumulate their gradients. The proposed model is trained with an auxiliary language modeling task. Intra-domain and cross-domain speech recognition experiments are conducted on AISHELL-1, LibriSpeech, and WenetSpeech corpora. Results show competitive performance to the conventional shallow fusion method with negligible computation overheads during inference.