now publishers - Multi-step Prediction and Control of Hierarchical Emotion Distribution in Text-to-speech Synthesis

APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 4

Multi-step Prediction and Control of Hierarchical Emotion Distribution in Text-to-speech Synthesis

Sho Inoue, The Chinese University of Hong Kong, Shenzhen, China AND Shenzhen Research Institute of Big Data, China, Kun Zhou, Alibaba Group, Singapore, Shuai Wang, Nanjing University, China, Haizhou Li, The Chinese University of Hong Kong, Shenzhen, China AND Shenzhen Research Institute of Big Data, China AND National University of Singapore, Singapore, haizhouli@cuhk.edu.cn

Suggested Citation

Sho Inoue, Kun Zhou, Shuai Wang and Haizhou Li (2025), "Multi-step Prediction and Control of Hierarchical Emotion Distribution in Text-to-speech Synthesis", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 4, e302. http://dx.doi.org/10.1561/116.20250010

Publication Date: 28 Oct 2025

Subjects

Signal processing for communications, Speech and spoken language processing

Keywords

Hierarchical emotion distribution, multi-step emotion prediction, text-to-speech synthesis

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 87 times

In this article:

Abstract

We investigate hierarchical emotion distribution (ED) for achieving multi-level quantitative control of emotion rendering in textto- speech synthesis (TTS). We introduce a novel multi-step hierarchical ED prediction module that quantifies emotion variance at the utterance, word, and phoneme levels. By predicting emotion variance in a multi-step manner, we leverage global emotional context to refine local emotional variations, thereby capturing the intrinsic hierarchical structure of speech emotion. Our approach is validated through its integration into a variance adaptor and an external module design compatible with various TTS systems. Both objective and subjective evaluations demonstrate that the proposed framework significantly enhances emotional expressiveness and enables precise control of emotion rendering across multiple speech granularities.

DOI:10.1561/116.20250010

Related publications

Companion

APSIPA Transactions on Signal and Information Processing Special Issue - Invited Papers from APSIPA ASC 2024
See the other articles that are part of this special issue.

Introduction
Related Works
Multi-step Prediction and Hierarchical Control of Emotion Intensity
Experimental Setup
Experiments and Results
Conclusion
References

Multi-step Prediction and Control of Hierarchical Emotion Distribution in Text-to-speech Synthesis

Share

Journal details

Abstract

Related publications