now publishers - Music Similarity Representation Learning Focusing on Individual Instruments with Source Separation and Human Preference

APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 4

Music Similarity Representation Learning Focusing on Individual Instruments with Source Separation and Human Preference

Takehiro Imamura, Nagoya University, Japan, imamura.takehiro@g.sp.m.is.nagoya-u.ac.jp , Yuka Hashizume, Nagoya University, Japan, Wen-Chin Huang, Nagoya University, Japan, Tomoki Toda, Nagoya University, Japan

Suggested Citation

Takehiro Imamura, Yuka Hashizume, Wen-Chin Huang and Tomoki Toda (2025), "Music Similarity Representation Learning Focusing on Individual Instruments with Source Separation and Human Preference", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 4, e303. http://dx.doi.org/10.1561/116.20250016

Publication Date: 28 Oct 2025

Subjects

Deep learning, Classification and prediction, Information extraction

Keywords

Music information retrieval, music similarity representation, music source separation

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 78 times

In this article:

Abstract

This paper proposes music similarity representation learning (MSRL) based on individual instruments (InMSRL) utilizing music source separation (MSS) and human preference without requiring clean instrument stems during inference. We propose three methods that effectively improve performance. First, we introduce end-to-end fine-tuning (E2E-FT) for the Cascade approach that sequentially performs MSS and music similarity feature extraction. E2E-FT allows the model to minimize the adverse effects of a separation error on the feature extraction. Second, we propose multi-task learning for the Direct approach that directly extracts disentangled music similarity features using a single music similarity feature extractor. Multi-task learning, which is based on the disentangled music similarity feature extraction and MSS based on reconstruction with disentangled music similarity features, further enhances instrument feature disentanglement. Third, we employ perception-aware fine-tuning (PAFT). PAFT utilizes human preference, allowing the model to perform InMSRL aligned with human perceptual similarity. We conduct experimental evaluations and demonstrate that 1) E2E-FT for Cascade significantly improves objective InMSRL performance, 2) the multi-task learning for Direct is also helpful to improve disentanglement performance in the feature extraction, 3) PAFT significantly enhances the perceptual InMSRL performance, and 4) Cascade with E2E-FT and PAFT outperforms Direct with the multi-task learning and PAFT.

DOI:10.1561/116.20250016

Related publications

Companion

APSIPA Transactions on Signal and Information Processing Special Issue - Invited Papers from APSIPA ASC 2024
See the other articles that are part of this special issue.

Introduction
Related Works
Proposed InMSRL Methods Leveraging Multi-task Learning and Human Preference
Experimental Evaluations
Conclusion
References

Music Similarity Representation Learning Focusing on Individual Instruments with Source Separation and Human Preference

Share

Journal details

Abstract

Related publications