APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-supervised Training of Sound Events With Partial Labels

Keisuke Imoto, Kyoto University, Japan, keisuke.imoto@ieee.org
 
Suggested Citation
Keisuke Imoto (2025), "Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-supervised Training of Sound Events With Partial Labels", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e31. http://dx.doi.org/10.1561/116.20250080

Publication Date: 06 Nov 2025
© 2025 K. Imoto
 
Subjects
Audio signal processing
 
Keywords
Acoustic scene classificationpartial labelsound event detection
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 2 times

In this article:
Introduction 
Conventional Methods 
Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-supervised Approach With Partial Labels of Sound Events 
Evaluation Experiments 
Conclusions 
References 

Abstract

Annotating time boundaries of sound events is labor-intensive, limiting the scalability of strongly supervised learning in audio detection. To reduce annotation costs, weakly-supervised learning with only clip-level labels has been widely adopted. As an alternative, partial label learning offers a cost-effective approach, where a set of possible labels is provided instead of exact weak annotations. However, partial label learning for audio analysis remains largely unexplored. Motivated by the observation that acoustic scenes provide contextual information for constructing a set of possible sound events, we utilize acoustic scene information to construct partial labels of sound events. On the basis of this idea, in this paper, we propose a multitask learning framework that jointly performs acoustic scene classification and sound event detection with partial labels of sound events. While reducing annotation costs, weakly-supervised and partial label learning often suffer from decreased detection performance due to lacking the precise event set and their temporal annotations. To better balance between annotation cost and detection performance, we also explore a semi-supervised framework that leverages both strong and partial labels. Moreover, to refine partial labels and achieve better model training, we propose a label refinement method based on self-distillation for the proposed approach with partial labels.

DOI:10.1561/116.20250080