APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Sequence-to-sequence Voice Conversion-based Techniques for Electrolaryngeal Speech Enhancement in Noisy and Reverberant Conditions

Ding Ma, Nagoya University, Japan, ding.ma@g.sp.m.is.nagoya-u.ac.jp , Yeonjong Choi, Nagoya University, Japan, Takuya Fujimura, Nagoya University, Japan, Fengji Li, Beihang University, China, Chao Xie, Nagoya University, Japan, Kazuhiro Kobayashi, Nagoya University, Japan AND TARVO, Inc., Japan, Tomoki Toda, Nagoya University, Japan
 
Suggested Citation
Ding Ma, Yeonjong Choi, Takuya Fujimura, Fengji Li, Chao Xie, Kazuhiro Kobayashi and Tomoki Toda (2025), "Sequence-to-sequence Voice Conversion-based Techniques for Electrolaryngeal Speech Enhancement in Noisy and Reverberant Conditions", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e8. http://dx.doi.org/10.1561/116.20240094

Publication Date: 22 May 2025
© 2025 D. Ma, Y. Choi, T. Fujimura, F. Li, C. Xie, K. Kobayashi and T. Toda
 
Subjects
Speech and spoken language processing,  Audio signal processing,  Biological and biomedical signal processing,  Enhancement,  Deep learning
 
Keywords
Electrolaryngeal speechsequence-to-sequence voice conversionrealworld scenariosnoisyreverberant
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 72 times

In this article:
Introduction 
Background and Related Works 
Proposed Method 
Experimental Evaluations 
Discussion and Conclusion 
Acknowledgements 
References 

Abstract

Electrolaryngeal (EL) speech is artificial speech produced using an electrolarynx to aid laryngectomees in communicating without vocal fold vibrations. Compared with normal speech, EL speech lacks essential phonetic features and differs in temporal structure, resulting in poor naturalness, speaker identity, and intelligibility. Sequence-to-sequence (seq2seq) voice conversion (VC) emerges as a promising technique for overcoming the challenges in EL-speech-to-normal-speech conversion (EL2SP). Nonetheless, most VC studies for EL2SP focus on converting clean EL speech, overlooking real-world scenarios where EL speech is interfered with background noise and reverberation. To address this, we propose a novel seq2seq VC-based training method. In contrast to relying on extra augmentation modules to tackle interferences, our method requires only a single framework. First, we pretrained a normal-tonormal seq2seq VC model, adapted from a text-to-speech model. Then, we employed a two-stage fine-tuning in a many-to-one style leveraging pseudo-noisy and reverberant EL speech data generated from limited clean data. We evaluated several system designs of our method. The intermediate representations of these systems were also analyzed to understand their role in filtering the interferences. Comparative experiments demonstrated that our method significantly outperforms EL2SP baselines, non-trivially handling both clean and noisy-reverberant EL speech, which sheds light on possible directions for improvement.

DOI:10.1561/116.20240094