now publishers - A multi-branch ResNet with discriminative features for detection of replay speech signals

APSIPA Transactions on Signal and Information Processing > Vol 9 > Issue 1

A multi-branch ResNet with discriminative features for detection of replay speech signals

Xingliang Cheng, Tsinghua University, China, Mingxing Xu, Tsinghua University, China, Thomas Fang Zheng, Tsinghua University, China, fzheng@tsinghua.edu.cn

Suggested Citation

Xingliang Cheng, Mingxing Xu and Thomas Fang Zheng (2020), "A multi-branch ResNet with discriminative features for detection of replay speech signals", APSIPA Transactions on Signal and Information Processing: Vol. 9: No. 1, e28. http://dx.doi.org/10.1017/ATSIP.2020.26

Publication Date: 29 Dec 2020

Subjects

Keywords

Anti-spoofing, Presentation attack detection, Replay attack, Speaker verification

Journal details

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 1852 times

In this article:

Abstract

Nowadays, the security of ASV systems is increasingly gaining attention. As one of the common spoofing methods, replay attacks are easy to implement but difficult to detect. Many researchers focus on designing various features to detect the distortion of replay attack attempts. Constant-Q cepstral coefficients (CQCC), based on the magnitude of the constant-Q transform (CQT), is one of the striking features in the field of replay detection. However, it ignores phase information, which may also be distorted in the replay processes. In this work, we propose a CQT-based modified group delay feature (CQTMGD) which can capture the phase information of CQT. Furthermore, a multi-branch residual convolution network, ResNeWt, is proposed to distinguish replay attacks from bonafide attempts. We evaluated our proposal in the ASVspoof 2019 physical access dataset. Results show that CQTMGD outperformed the traditional MGD feature, and the fusion with other magnitude-based and phase-based features achieved a further improvement. Our best fusion system achieved 0.0096 min-tDCF and 0.39% EER on the evaluation set and it outperformed all the other state-of-the-art methods in the ASVspoof 2019 physical access challenge.

DOI:10.1017/ATSIP.2020.26

I. INTRODUCTION
II. RELATED WORK
III. CQT-BASED MODIFIED GROUP DELAY FEATURE
IV. MULTI-BRANCH RESIDUAL NEURAL NETWORK
V. EXPERIMENTAL SETUP
VI. EXPERIMENTAL RESULTS
VII. DISCUSSION
VIII. CONCLUSION

A multi-branch ResNet with discriminative features for detection of replay speech signals

Share

Journal details

Abstract