APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Text- and Speech-style Control for Lecture Speech Generation Focusing on Disfluency

Daiki Yoshioka, Nagoya University, Japan, yoshioka.daiki@g.sp.m.is.nagoya-u.ac.jp , Yuuto Nakata, Tokuyama College, Japan, Yusuke Yasuda, Nagoya University, Japan, Tomoki Toda, Nagoya University, Japan
 
Suggested Citation
Daiki Yoshioka, Yuuto Nakata, Yusuke Yasuda and Tomoki Toda (2025), "Text- and Speech-style Control for Lecture Speech Generation Focusing on Disfluency", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e26. http://dx.doi.org/10.1561/116.20250005

Publication Date: 24 Sep 2025
© 2025 R. Kawano and M. Kawamura
 
Subjects
Speech and spoken language processing
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 63 times

In this article:
Introduction 
Related Works 
Proposed Method 
Experimental Evaluations 
Conclusion 
References Results 

Abstract

In this paper, we propose text style transfer (TST) and text-to-speech synthesis (TTS) using disfluency annotation for the application of “spontaneous speech synthesis using the written text.” TTS technology has progressed significantly, achieving human-like naturalness in reading-style speech generation. However, it is still developing when it comes to producing more spontaneous humanlike speech. Moreover, for existing spontaneous speech synthesizers, it is assumed that the input text contains spontaneous parts such as disfluencies. Therefore, we aim to synthesize spontaneous speech with disfluency on the basis of written materials without disfluent parts. Specifically, we train the TST and TTS systems for lecture speech generation by tagging disfluencies with special symbols or converting disfluencies into special symbols to enhance each model’s linguistic and acoustic control over disfluencies. We combine the TST and TTS systems using disfluency annotation to create a lecture speech generation system and demonstrate the effectiveness of our method by comparing the results of objective and subjective evaluation experiments with those obtained without disfluency annotation.

DOI:10.1561/116.20250005