APSIPA Transactions on Signal and Information Processing > Vol 7 > Issue 1

A technical framework for automatic perceptual evaluation of singing quality

Chitralekha Gupta, NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore, Singapore, chitralekha@u.nus.edu , Haizhou Li, Electrical and Computer Engineering Department, National University of Singapore, Singapore, Ye Wang, NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore, Singapore
 
Suggested Citation
Chitralekha Gupta, Haizhou Li and Ye Wang (2018), "A technical framework for automatic perceptual evaluation of singing quality", APSIPA Transactions on Signal and Information Processing: Vol. 7: No. 1, e10. http://dx.doi.org/10.1017/ATSIP.2018.10

Publication Date: 14 Sep 2018
© 2018 Chitralekha Gupta, Haizhou Li and Ye Wang
 
Subjects
 
Keywords
Singing VocalPerceptual Evaluation of Singing QualityAutomatic EvaluationHuman Perception
 

Share

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 1093 times

In this article:
I. INTRODUCTION 
II. FRAMEWORK OF EVALUATION 
III. CHARACTERIZATION OF SINGING QUALITY 
IV. EXPERIMENTS 
V. CONCLUSIONS 

Abstract

Human experts evaluate singing quality based on many perceptual parameters such as intonation, rhythm, and vibrato, with reference to music theory. We proposed previously the Perceptual Evaluation of Singing Quality (PESnQ) framework that incorporated acoustic features related to these perceptual parameters in combination with the cognitive modeling concept of the telecommunication standard Perceptual Evaluation of Speech Quality to evaluate singing quality. In this study, we present further the study of the PESnQ framework to approximate the human judgments. First, we find that a linear combination of the individual perceptual parameter human scores can predict their overall singing quality judgment. This provides us with a human parametric judgment equation. Next, the prediction of the individual perceptual parameter scores from the PESnQ acoustic features show a high correlation with the respective human scores, which means more meaningful feedback to learners. Finally, we compare the performance of early fusion and late fusion of the acoustic features in predicting the overall human scores. We find that the late fusion method is superior to that of the early fusion method. This work underlines the importance of modeling human perception in automatic singing quality assessment.

DOI:10.1017/ATSIP.2018.10