now publishers - 3D skeletal movement-enhanced emotion recognition networks

APSIPA Transactions on Signal and Information Processing > Vol 10 > Issue 1

3D skeletal movement-enhanced emotion recognition networks

Jiaqi Shi, Graduate School of Engineering Science, Osaka University, Japan AND Guardian Robot Project, RIKEN, Japan, shi.jiaqi@irl.sys.es.osaka-u.ac.jp , Chaoran Liu, Advanced Telecommunications Research Institute International, Japan, Carlos Toshinori Ishi, Guardian Robot Project, RIKEN, Japan AND Advanced Telecommunications Research Institute International, Japan, Hiroshi Ishiguro, Graduate School of Engineering Science, Osaka University, Japan AND Advanced Telecommunications Research Institute International, Japan

Suggested Citation

Jiaqi Shi, Chaoran Liu, Carlos Toshinori Ishi and Hiroshi Ishiguro (2021), "3D skeletal movement-enhanced emotion recognition networks", APSIPA Transactions on Signal and Information Processing: Vol. 10: No. 1, e12. http://dx.doi.org/10.1017/ATSIP.2021.11

Publication Date: 05 Aug 2021

Subjects

Keywords

Deep learning, emotion recognition, gesture, skeleton

Journal details

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 1683 times

In this article:

Abstract

Automatic emotion recognition has become an important trend in the fields of human–computer natural interaction and artificial intelligence. Although gesture is one of the most important components of nonverbal communication, which has a considerable impact on emotion recognition, it is rarely considered in the study of emotion recognition. An important reason is the lack of large open-source emotional databases containing skeletal movement data. In this paper, we extract three-dimensional skeleton information from videos and apply the method to IEMOCAP database to add a new modality. We propose an attention-based convolutional neural network which takes the extracted data as input to predict the speakers’ emotional state. We also propose a graph attention-based fusion method that combines our model with the models using other modalities, to provide complementary information in the emotion classification task and effectively fuse multimodal cues. The combined model utilizes audio signals, text information, and skeletal data. The performance of the model significantly outperforms the bimodal model and other fusion strategies, proving the effectiveness of the method.

DOI:10.1017/ATSIP.2021.11

I. INTRODUCTION
II. RELATED STUDIES
III. METHODOLOGY
IV. EXPERIMENTS AND RESULTS
V. ABLATION STUDY AND DISCUSSION
VI. CONCLUSIONS
FINANCIAL SUPPORT

3D skeletal movement-enhanced emotion recognition networks

Share

Journal details

Abstract