APSIPA Transactions on Signal and Information Processing > Vol 10 > Issue 1

Speech emotion recognition based on listener-dependent emotion perception models

Atsushi Ando, NTT Corporation, Japan, atsushi.ando.hd@hco.ntt.co.jp , Takeshi Mori, NTT Corporation, Japan, Satoshi Kobashikawa, NTT Corporation, Japan, Tomoki Toda, Nagoya University, Japan
 
Suggested Citation
Atsushi Ando, Takeshi Mori, Satoshi Kobashikawa and Tomoki Toda (2021), "Speech emotion recognition based on listener-dependent emotion perception models", APSIPA Transactions on Signal and Information Processing: Vol. 10: No. 1, e6. http://dx.doi.org/10.1017/ATSIP.2021.7

Publication Date: 20 Apr 2021
© 2021 Atsushi Ando, Takeshi Mori, Satoshi Kobashikawa and Tomoki Toda
 
Subjects
 
Keywords
Speech emotion recognitionperceived emotionadaptation
 

Share

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 2527 times

In this article:
I. INTRODUCTION 
II. RELATED WORK 
III. EMOTION RECOGNITION BY MAJORITY-VOTED MODEL 
IV. EMOTION RECOGNITION BY LD MODELS 
V. EXPERIMENTS 
VI. Conclusion 

Abstract

This paper presents a novel speech emotion recognition scheme that leverages the individuality of emotion perception. Most conventional methods simply poll multiple listeners and directly model the majority decision as the perceived emotion. However, emotion perception varies with the listener, which forces the conventional methods with their single models to create complex mixtures of emotion perception criteria. In order to mitigate this problem, we propose a majority-voted emotion recognition framework that constructs listener-dependent (LD) emotion recognition models. The LD model can estimate not only listener-wise perceived emotion, but also majority decision by averaging the outputs of the multiple LD models. Three LD models, fine-tuning, auxiliary input, and sub-layer weighting, are introduced, all of which are inspired by successful domain-adaptation frameworks in various speech processing tasks. Experiments on two emotional speech datasets demonstrate that the proposed approach outperforms the conventional emotion recognition frameworks in not only majority-voted but also listener-wise perceived emotion recognition.

DOI:10.1017/ATSIP.2021.7