Recently, much research has been actively conducted on speech emotion recognition (SER) using deep learning, which predicts emotions conveyed by speech. Our study focused on a method of recognizing emotions at each frame level. One challenge with this approach is that emotion label sequences, which are used for training the frame-based SER, do not sufficiently account for phonemic characteristics. To overcome this limitation, we propose a new frame-based SER methods using fine-grained emotion label sequences that considers phoneme class attributes, such as vowels, voiced consonants, unvoiced consonants, and other symbols. As a result, we found that the proposed methods improve the utteranceand frame-level performance compared with conventional methods.