now publishers - Sequence-to-sequence Voice Conversion-based Techniques for Electrolaryngeal Speech Enhancement in Noisy and Reverberant Conditions

APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Sequence-to-sequence Voice Conversion-based Techniques for Electrolaryngeal Speech Enhancement in Noisy and Reverberant Conditions

Ding Ma, Nagoya University, Japan, ding.ma@g.sp.m.is.nagoya-u.ac.jp , Yeonjong Choi, Nagoya University, Japan, Takuya Fujimura, Nagoya University, Japan, Fengji Li, Beihang University, China, Chao Xie, Nagoya University, Japan, Kazuhiro Kobayashi, Nagoya University, Japan AND TARVO, Inc., Japan, Tomoki Toda, Nagoya University, Japan

Suggested Citation

Ding Ma, Yeonjong Choi, Takuya Fujimura, Fengji Li, Chao Xie, Kazuhiro Kobayashi and Tomoki Toda (2025), "Sequence-to-sequence Voice Conversion-based Techniques for Electrolaryngeal Speech Enhancement in Noisy and Reverberant Conditions", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e8. http://dx.doi.org/10.1561/116.20240094

Publication Date: 22 May 2025

Subjects

Speech and spoken language processing, Audio signal processing, Biological and biomedical signal processing, Enhancement, Deep learning

Keywords

Electrolaryngeal speech, sequence-to-sequence voice conversion, realworld scenarios, noisy, reverberant

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 1047 times

In this article:

Abstract

Electrolaryngeal (EL) speech is artificial speech produced using an electrolarynx to aid laryngectomees in communicating without vocal fold vibrations. Compared with normal speech, EL speech lacks essential phonetic features and differs in temporal structure, resulting in poor naturalness, speaker identity, and intelligibility. Sequence-to-sequence (seq2seq) voice conversion (VC) emerges as a promising technique for overcoming the challenges in EL-speech-to-normal-speech conversion (EL2SP). Nonetheless, most VC studies for EL2SP focus on converting clean EL speech, overlooking real-world scenarios where EL speech is interfered with background noise and reverberation. To address this, we propose a novel seq2seq VC-based training method. In contrast to relying on extra augmentation modules to tackle interferences, our method requires only a single framework. First, we pretrained a normal-tonormal seq2seq VC model, adapted from a text-to-speech model. Then, we employed a two-stage fine-tuning in a many-to-one style leveraging pseudo-noisy and reverberant EL speech data generated from limited clean data. We evaluated several system designs of our method. The intermediate representations of these systems were also analyzed to understand their role in filtering the interferences. Comparative experiments demonstrated that our method significantly outperforms EL2SP baselines, non-trivially handling both clean and noisy-reverberant EL speech, which sheds light on possible directions for improvement.

DOI:10.1561/116.20240094

Introduction
Background and Related Works
Proposed Method
Experimental Evaluations
Discussion and Conclusion
Acknowledgements
References

Sequence-to-sequence Voice Conversion-based Techniques for Electrolaryngeal Speech Enhancement in Noisy and Reverberant Conditions

Share

Journal details

Abstract