now publishers - An evaluation of voice conversion with neural network spectral mapping models and WaveNet vocoder

APSIPA Transactions on Signal and Information Processing > Vol 9 > Issue 1

An evaluation of voice conversion with neural network spectral mapping models and WaveNet vocoder

Patrick Lumban Tobing, Graduate School of Information Science, Nagoya University, Japan, patrick.lumbantobing@g.sp.m.is.nagoya-u.ac.jp , Yi-Chiao Wu, Graduate School of Information Science, Nagoya University, Japan, Tomoki Hayashi, Graduate School of Information Science, Nagoya University, Japan, Kazuhiro Kobayashi, Information Technology Center, Nagoya University, Japan, Tomoki Toda, Information Technology Center, Nagoya University, Japan

Suggested Citation

Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi and Tomoki Toda (2020), "An evaluation of voice conversion with neural network spectral mapping models and WaveNet vocoder", APSIPA Transactions on Signal and Information Processing: Vol. 9: No. 1, e26. http://dx.doi.org/10.1017/ATSIP.2020.24

Publication Date: 25 Nov 2020

Subjects

Keywords

Voice conversion, Neural network, Spectral mapping, WaveNet vocoder, Oversmoothed parameters

Journal details

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 2378 times

In this article:

Abstract

This paper presents an evaluation of parallel voice conversion (VC) with neural network (NN)-based statistical models for spectral mapping and waveform generation. The NN-based architectures for spectral mapping include deep NN (DNN), deep mixture density network (DMDN), and recurrent NN (RNN) models. WaveNet (WN) vocoder is employed as a high-quality NN-based waveform generation. In VC, though, owing to the oversmoothed characteristics of estimated speech parameters, quality degradation still occurs. To address this problem, we utilize post-conversion for the converted features based on direct waveform modifferential and global variance postfilter. To preserve the consistency with the post-conversion, we further propose a spectrum differential loss for the spectral modeling. The experimental results demonstrate that: (1) the RNN-based spectral modeling achieves higher accuracy with a faster convergence rate and better generalization compared to the DNN-/DMDN-based models; (2) the RNN-based spectral modeling is also capable of producing less oversmoothed spectral trajectory; (3) the use of proposed spectrum differential loss improves the performance in the same-gender conversions; and (4) the proposed post-conversion on converted features for the WN vocoder in VC yields the best performance in both naturalness and speaker similarity compared to the conventional use of WN vocoder.

DOI:10.1017/ATSIP.2020.24

I. INTRODUCTION
II. COMPARISON TO PREVIOUS WORK
III. SPECTRAL CONVERSION MODELS WITH NN-BASED ARCHITECTURES
IV. WAVEFORM GENERATION MODELS WITH WN VOCODER
V. EXPERIMENTAL EVALUATION
VI. CONCLUSION

An evaluation of voice conversion with neural network spectral mapping models and WaveNet vocoder

Share

Journal details

Abstract