APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Speech Emotion Recognition Using Sequences of Fine-grained Emotion Labels with Phoneme Class Attributes

Ryotaro Nagase, Ritsumeikan University, Japan, rnagase@fc.ritsumei.ac.jp , Takahiro Fukumori, Ritsumeikan University, Japan, Yoichi Yamashita, Ritsumeikan University, Japan
 
Suggested Citation
Ryotaro Nagase, Takahiro Fukumori and Yoichi Yamashita (2025), "Speech Emotion Recognition Using Sequences of Fine-grained Emotion Labels with Phoneme Class Attributes", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e17. http://dx.doi.org/10.1561/116.20240077

Publication Date: 17 Jul 2025
© 2025 R. Nagase, T. Fukumori and Y. Yamashita
 
Subjects
Deep learning,  Classification and prediction,  Speech and spoken language processing
 
Keywords
Speech emotion recognitiondeep learningemotion label sequencephoneme class attribute
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 9 times

In this article:
Introduction 
SER Using Emotion Label Sequences 
SER Using Emotion Label Sequences with Phoneme Class Attributes 
Experimental Setup 
Results 
Conclusion 
Biographies 
References 

Abstract

Recently, much research has been actively conducted on speech emotion recognition (SER) using deep learning, which predicts emotions conveyed by speech. Our study focused on a method of recognizing emotions at each frame level. One challenge with this approach is that emotion label sequences, which are used for training the frame-based SER, do not sufficiently account for phonemic characteristics. To overcome this limitation, we propose a new frame-based SER methods using fine-grained emotion label sequences that considers phoneme class attributes, such as vowels, voiced consonants, unvoiced consonants, and other symbols. As a result, we found that the proposed methods improve the utteranceand frame-level performance compared with conventional methods.

DOI:10.1561/116.20240077