APSIPA Transactions on Signal and Information Processing > Vol 13 > Issue 1

Heterogeneous Convolutional Recurrent Neural Network with Attention Mechanism and Feature Aggregation for Voice Activity Detection

YingWei Tan, Volkswagen-Mobvoi (Beijing) Information Technology Co., Ltd., China, ywtan@vw-mobvoi.com , XueFeng Ding, Volkswagen-Mobvoi (Beijing) Information Technology Co., Ltd., China
Suggested Citation
YingWei Tan and XueFeng Ding (2024), "Heterogeneous Convolutional Recurrent Neural Network with Attention Mechanism and Feature Aggregation for Voice Activity Detection", APSIPA Transactions on Signal and Information Processing: Vol. 13: No. 1, e6. http://dx.doi.org/10.1561/116.00000158

Publication Date: 29 Feb 2024
© 2024 Y. Tang and X. Ding


Open Access

This is published under the terms of CC BY-NC.

Downloaded: 245 times

In this article:


Voice activity detection (VAD) is a fundamental prerequisite for tasks involving speech processing, particularly automatic speech recognition (ASR). Traditional supervised VAD systems employ a single type of network to acquire frame-level labels from the ASR pipeline, yet their detection performance often falls short of satisfactory levels, impeding the identification of high-quality speech by these systems. In this study, we present a novel heterogeneous convolutional recurrent neural network (HCRNN) with an attention mechanism and feature aggregation for voice activity detection. This approach effectively integrates the advantages of distinct networks, aiming to achieve superior performance in voice activity detection. We begin by presenting our detection framework, which employs a convolutional neural network (CNN) as the initial component of a long short term memory (LSTM) or gated recurrent unit (GRU) architecture. The feature map obtained from this front-end CNN is subsequently fed into the LSTM or GRU component of the system. The choice of LSTM or GRU lies in their ability to model long-term dependencies between inputs, a crucial aspect in voice activity detection. To enhance the framework's performance, we introduce two novel attention mechanisms. The first mechanism focuses on the fusion of both spatial and channel-wise information within local receptive fields. Given an intermediate feature map, our module generates attention maps along two independent dimensions: channel and spatial. These attention maps are then multiplied with the input feature map to achieve adaptive feature refinement. The second attention mechanism is dedicated to discovering contextual features from embedded sequences using a multi-head self-attention (MHSA) layer. This layer allows the model to capture relationships between different elements within sequences, further enhancing the representation power of the system. Finally, refined features from the LSTM or GRU back-end are aggregated using either trainable scalar weights or vector-based attention weights. This aggregation step ensures that the most relevant features are emphasized, contributing to more accurate voice activity detection. To evaluate the efficacy of our proposed method, we conducted experiments on synthetic VAD datasets, Kaggle VAD datasets and AVA-speech datasets. The results demonstrate that the proposed method outperforms the baseline CRNN in low signal-to-noise ratio and noisy scenarios, exhibiting robustness against various noise types. Summarizing, our framework effectively integrates the strengths of CNN and RNN (LSTM or GRU) to enhance detection performance. The inclusion of attention mechanisms and feature aggregation further optimizes system performance, making it a promising approach for voice activity detection.