now publishers - Text- and Speech-style Control for Lecture Speech Generation Focusing on Disfluency

APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Text- and Speech-style Control for Lecture Speech Generation Focusing on Disfluency

Daiki Yoshioka, Nagoya University, Japan, yoshioka.daiki@g.sp.m.is.nagoya-u.ac.jp , Yuuto Nakata, Tokuyama College, Japan, Yusuke Yasuda, Nagoya University, Japan, Tomoki Toda, Nagoya University, Japan

Suggested Citation

Daiki Yoshioka, Yuuto Nakata, Yusuke Yasuda and Tomoki Toda (2025), "Text- and Speech-style Control for Lecture Speech Generation Focusing on Disfluency", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e26. http://dx.doi.org/10.1561/116.20250005

Publication Date: 24 Sep 2025

Subjects

Speech and spoken language processing

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 168 times

In this article:

Abstract

In this paper, we propose text style transfer (TST) and text-to-speech synthesis (TTS) using disfluency annotation for the application of “spontaneous speech synthesis using the written text.” TTS technology has progressed significantly, achieving human-like naturalness in reading-style speech generation. However, it is still developing when it comes to producing more spontaneous humanlike speech. Moreover, for existing spontaneous speech synthesizers, it is assumed that the input text contains spontaneous parts such as disfluencies. Therefore, we aim to synthesize spontaneous speech with disfluency on the basis of written materials without disfluent parts. Specifically, we train the TST and TTS systems for lecture speech generation by tagging disfluencies with special symbols or converting disfluencies into special symbols to enhance each model’s linguistic and acoustic control over disfluencies. We combine the TST and TTS systems using disfluency annotation to create a lecture speech generation system and demonstrate the effectiveness of our method by comparing the results of objective and subjective evaluation experiments with those obtained without disfluency annotation.

DOI:10.1561/116.20250005

Introduction
Related Works
Proposed Method
Experimental Evaluations
Conclusion
References Results

Text- and Speech-style Control for Lecture Speech Generation Focusing on Disfluency

Share

Journal details

Abstract