APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

An Investigation of Noisy-to-noisy Voice Conversion Performance in Various Noisy Conditions

Chao Xie, Nagoya University, Japan, xie.chao@g.sp.m.is.nagoya-u.ac.jp , Tomoki Toda, Nagoya University, Japan
 
Suggested Citation
Chao Xie and Tomoki Toda (2025), "An Investigation of Noisy-to-noisy Voice Conversion Performance in Various Noisy Conditions", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e10. http://dx.doi.org/10.1561/116.20250008

Publication Date: 10 Jun 2025
© 2025 C. Xie and T. Toda
 
Subjects
Speech and spoken language processing,  Denoising,  Deep learning
 
Keywords
Voice conversion (VC)noisy-to-noisy VCnoisy speech modelingmutual informationnoise dropout
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 22 times

In this article:
Introduction 
Related Work 
Analysis of N2N-VC Performance Degradation 
Proposed Method 
Experimental Setup 
Experimental Results 
Conclusion 
Appendix: Supplementary Evaluation Results 
References 

Abstract

Voice conversion (VC) in a noisy-to-noisy (N2N) scenario aims to convert the speaker identity of noisy speech to a target speaker while preserving both the linguistic content and background noise. In our previous work, we proposed an N2N framework for this conversion. Notably, our VC approach relies solely on noisy speech data for training without requiring clean speech data from either the source or target speakers. Additionally, the framework enables the retention or removal of the noise component in the converted speech during conversion. However, significant performance degradation was observed in the N2N framework when certain noisy conditions were present in the training data. In this paper, we further investigate adverse noisy conditions affecting our framework’s performance. We identify two key factors contributing to performance degradation: the lack of noise diversity leading to feature entanglement and noise bias during training. To address these issues, we introduce a mutual information approximation and a noise dropout strategy into the N2N framework. Objective and subjective evaluations validate the effectiveness of our approach in improving converted speech quality and mitigating VC performance degradation under adverse noisy conditions.

DOI:10.1561/116.20250008