now publishers - Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning

APSIPA Transactions on Signal and Information Processing > Vol 9 > Issue 1

Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning

Bagus Tris Atmaja, Japan Advanced Institute of Science and Technology, Japan AND Sepuluh Nopember Institute of Technology, Indonesia, bagus@jaist.ac.jp , Masato Akagi, Japan Advanced Institute of Science and Technology, Japan

Suggested Citation

Bagus Tris Atmaja and Masato Akagi (2020), "Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning", APSIPA Transactions on Signal and Information Processing: Vol. 9: No. 1, e17. http://dx.doi.org/10.1017/ATSIP.2020.14

Publication Date: 27 May 2020

Subjects

Keywords

Speech emotion recognition, Multitask learning, Feature fusion, Dimensional emotion, Affective computing

Journal details

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 5575 times

In this article:

Abstract

The majority of research in speech emotion recognition (SER) is conducted to recognize emotion categories. Recognizing dimensional emotion attributes is also important, however, and it has several advantages over categorical emotion. For this research, we investigate dimensional SER using both speech features and word embeddings. The concatenation network joins acoustic networks and text networks from bimodal features. We demonstrate that those bimodal features, both are extracted from speech, improve the performance of dimensional SER over unimodal SER either using acoustic features or word embeddings. A significant improvement on the valence dimension is contributed by the addition of word embeddings to SER system, while arousal and dominance dimensions are also improved. We proposed a multitask learning (MTL) approach for the prediction of all emotional attributes. This MTL maximizes the concordance correlation between predicted emotion degrees and true emotion labels simultaneously. The findings suggest that the use of MTL with two parameters is better than other evaluated methods in representing the interrelation of emotional attributes. In unimodal results, speech features attain higher performance on arousal and dominance, while word embeddings are better for predicting valence. The overall evaluation uses the concordance correlation coefficient score of the three emotional attributes. We also discuss some differences between categorical and dimensional emotion results from psychological and engineering perspectives.

DOI:10.1017/ATSIP.2020.14

I. INTRODUCTION
II. RELATED WORK
III. FEATURE SETS
IV. DIMENSIONAL SPEECH EMOTION RECOGNITION SYSTEM
V. MULTITASK LEARNING
VI. EXPERIMENTAL RESULTS AND DISCUSSION
VII. CONCLUSIONS

Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning

Share

Journal details

Abstract