APSIPA Transactions on Signal and Information Processing > Vol 12 > Issue 2

Speaker-Specific Articulatory Feature Extraction Based on Knowledge Distillation for Speaker Recognition

Qian-Bei Hong, Graduate Program of Multimedia Systems and Intelligent Computing, National Cheng Kung University and Academia Sinica, Taiwan, Chung-Hsien Wu, Graduate Program of Multimedia Systems and Intelligent Computing, National Cheng Kung University and Academia Sinica, and Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan, chunghsienwu@gmail.com , Hsin-Min Wang, Graduate Program of Multimedia Systems and Intelligent Computing, National Cheng Kung University and Academia Sinica, Taiwan
 
Suggested Citation
Qian-Bei Hong, Chung-Hsien Wu and Hsin-Min Wang (2023), "Speaker-Specific Articulatory Feature Extraction Based on Knowledge Distillation for Speaker Recognition", APSIPA Transactions on Signal and Information Processing: Vol. 12: No. 2, e10. http://dx.doi.org/10.1561/116.00000150

Publication Date: 03 Apr 2023
© 2023 Q.-B. Hong, C.-H. Wu and H.-M. Wang
 
Subjects
 
Keywords
Speaker recognitionarticulatory featureknowledge distillation
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 523 times

In this article:
Introduction 
Speaker Embedding Extraction 
Articulatory Feature Extraction 
Knowledge Distillation for Speaker Recognition 
Experimental Results 
Conclusion 
References 

Abstract

This paper proposes a novel speaker-specific articulatory feature (AF) extraction model based on knowledge distillation (KD) for speaker recognition. First, an AF extractor is trained as a teacher model for extracting the AF profiles of the input speaker dataset. Next, a KD-based speaker embedding extraction method is proposed to distill the speaker-specific information from the AF profiles in the teacher model to a student model based on multi-task learning, in which the lower layers not only capture the speaker characteristics from acoustic features, but also learn the speaker-specific features from the AF profiles for robust speaker representation. Finally, speaker embeddings are extracted from the high-level layer, and the obtained speaker embeddings are further used to train a probabilistic linear discriminant analysis (PLDA) model for speaker recognition. In the experiments, speaker embedding models were trained using the VoxCeleb2 dataset and the AF extractor was trained based on the LibriSpeech dataset, and the performance was evaluated using the VoxCeleb1 dataset. The experiments showed that the proposed KD-based models outperformed the baseline models without KD. Furthermore, feature concatenation of multimodal results can further improve the performance.

DOI:10.1561/116.00000150

Companion

APSIPA Transactions on Signal and Information Processing Special Issue - Learning, Security, AIoT for Emerging Communication/Networking Systems
See the other articles that are part of this special issue.