APSIPA Transactions on Signal and Information Processing > Vol 13 > Issue 1

Emotion-controllable Speech Synthesis Using Emotion Soft Label, Utterance-level Prosodic Factors, and Word-level Prominence

Xuan Luo, The University of Tokyo, Japan, Shinnosuke Takamichi, The University of Tokyo, Japan, shinnosuke_takamichi@ipc.i.u-tokyo , Yuki Saito, The University of Tokyo, Japan, Tomoki Koriyama, CyberAgent, Japan, Hiroshi Saruwatari, The University of Tokyo, Japan
 
Suggested Citation
Xuan Luo, Shinnosuke Takamichi, Yuki Saito, Tomoki Koriyama and Hiroshi Saruwatari (2024), "Emotion-controllable Speech Synthesis Using Emotion Soft Label, Utterance-level Prosodic Factors, and Word-level Prominence", APSIPA Transactions on Signal and Information Processing: Vol. 13: No. 1, e2. http://dx.doi.org/10.1561/116.00000242

Publication Date: 13 Feb 2024
© 2024 X. Luo, S. Takamichi, Y. Saito, T. Koriyama and H. Saruwatari
 
Subjects
 
Keywords
Emotion-controllable speech synthesisExpressive speech synthesisControllable speech synthesisText to speechSpeech emotion recognition
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 664 times

In this article:
Introduction 
Related Work 
Proposed Work 
Experimental Setup 
Evaluation 
Conclusion and Discussion 
Appendix 
References 

Abstract

We propose a two-stage emotion-controllable text-to-speech (TTS) model that can increase the diversity of intra-emotion variation and also preserve inter-emotion controllability in synthesized speech. Conventional emotion-controllable TTS models increase the diversity of intra-emotion variation by controlling fine-grained emotion strengths; however, such models cannot control various prosodic factors (e.g., pitch). While other methods directly condition TTS models on intuitive prosodic factors, they cannot control emotions. Our proposed two-stage emotion-controllable TTS model extends the Tacotron2 model with a speech emotion recognizer (SER) and a prosodic factor generator (PFG) to solve this problem. In the first stage, we condition our model on emotion soft labels predicted by the SER model to enable inter-emotion controllability. In the second stage, we fine-condition our model on utterance-level prosodic factors and word-level prominence generated by the PFG model from emotion soft labels, which provides intra-emotion diversity. Due to this two-stage control design, we can increase intra-emotion diversity at both the utterance and word levels, and also preserve inter-emotion controllability. The experiments achieved 1) 51% emotion-distinguishable accuracy on average when conditioning on soft labels of three emotions, 2) average linear controllability scores of 0.95 when fine-conditioning on prosodic factors and prominence, respectively, and 3) comparable audio quality to conventional models.

DOI:10.1561/116.00000242