APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Learning Separated Representations for Instrument-based Music Similarity

Yuka Hashizume, Nagoya University, Japan, hashizume.yuuka@g.sp.m.is.nagoya-u.ac.jp , Li Li, Nagoya University, Japan, Atsushi Miyashita, Nagoya University, Japan, Tomoki Toda, Nagoya University, Japan
 
Suggested Citation
Yuka Hashizume, Li Li, Atsushi Miyashita and Tomoki Toda (2025), "Learning Separated Representations for Instrument-based Music Similarity", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e16. http://dx.doi.org/10.1561/116.20250013

Publication Date: 15 Jul 2025
© 2025 Y. Hashizume, L. Li, A. Miyashita and T. Toda
 
Subjects
Deep learning,  Audio signal processing
 
Keywords
Music similaritymusic information retrievalmusic recommendationrepresentation learning
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 6 times

In this article:
Introduction 
Related Work 
Proposed Method 
Experimental Evaluation 
Conclusions 
References 

Abstract

A flexible recommendation and retrieval system requires music similarity in terms of multiple partial elements of musical pieces to allow users to select the element they want to focus on. A method for music similarity learning using multiple networks with individual instrumental signals is effective but faces the problem that using each clean instrumental signal as a query is impractical for retrieval systems and using separated instrumental signals reduces accuracy owing to artifacts. In this paper, we present instrumentalpart- based music similarity learning with a single network that takes mixed signals as input instead of individual instrumental signals. Specifically, we designed a single similarity embedding space with separated subspaces for each instrument, extracted by Conditional Similarity Networks, which are trained using the triplet loss with masks. Experimental results showed that (1) the proposed method can obtain more accurate embedding representation than using individual networks using separated signals as input in the evaluation of an instrument that had low accuracy, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human acceptance, especially when focusing on timbre.

DOI:10.1561/116.20250013