APSIPA Transactions on Signal and Information Processing > Vol 13 > Issue 5

End-to-End Singing Transcription Based on CTC and HSMM Decoding with a Refined Score Representation

Tengyu Deng, Kyoto University, Japan, Eita Nakamura, Kyushu University, Japan, nakamura@inf.kyushu-u.ac.jp , Ryo Nishikimi, NTT Communication Science Laboratories, Japan, Kazuyoshi Yoshii, Kyoto University, Japan
 
Suggested Citation
Tengyu Deng, Eita Nakamura, Ryo Nishikimi and Kazuyoshi Yoshii (2024), "End-to-End Singing Transcription Based on CTC and HSMM Decoding with a Refined Score Representation", APSIPA Transactions on Signal and Information Processing: Vol. 13: No. 5, e404. http://dx.doi.org/10.1561/116.20240016

Publication Date: 07 Oct 2024
© 2024 T. Deng, E. Nakamura, R. Nishikimi and K. Yoshii
 
Subjects
Deep learning,  Audio signal processing,  Statistical/Machine learning
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 164 times

In this article:
Introduction 
Musical Score Representations 
Related Work 
Proposed Method 
Evaluation 
Conclusion and Discussion 
References 

Abstract

This paper describes an end-to-end automatic singing transcription (AST) method that translates a music audio signal containing a vocal part into a symbolic musical score of sung notes. A common approach to sequence-to-sequence learning for this problem is to use the connectionist temporal classification (CTC), where a target score is represented as a sequence of notes with discrete pitches and note values. However, if the note value of some note is incorrectly estimated, the score times of the following notes are estimated incorrectly and the metrical structure of the estimated score collapses. To solve this problem, we propose a refined score representation using metrical positions of note onsets. To decode a musical score from the output of a deep neural network (DNN), we use a hidden semi-Markov model (HSMM) that incorporates prior knowledge about musical scores and temporal fluctuation in human performance. We show that the proposed method achieves the state-of-the-art performance and confirm the efficacy of the refined score representation and the decoding method.

DOI:10.1561/116.20240016

Companion

APSIPA Transactions on Signal and Information Processing Special Issue - Invited Papers from APSIPA ASC 2023
See the other articles that are part of this special issue.