APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Audio Difference Learning Framework for Audio Captioning

Tatsuya Komatsu, LY Corporation, Japan AND Nagoya University, Japan, komatsu.tatsuya@lycorp.co.jp , Kazuya Takeda, Nagoya University, Japan, Tomoki Toda, Nagoya University, Japan
 
Suggested Citation
Tatsuya Komatsu, Kazuya Takeda and Tomoki Toda (2025), "Audio Difference Learning Framework for Audio Captioning", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e34. http://dx.doi.org/10.1561/116.20250021

Publication Date: 20 Nov 2025
© 2025 T. Komatsu, K. Takeda and T. Toda
 
Subjects
Audio signal processing,  Speech and spoken language processing,  Multimodal signal processing,  Statistical/Machine learning,  Pattern recognition and learning,  Deep learning
 
Keywords
Audio captioningaudio difference captioningaudio difference learning
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 50 times

In this article:
Introduction 
Related Works 
Audio Captioning 
Proposed Method: Audio Difference Learning 
Experimental Evaluations 
Conclusion 
References 

Abstract

This paper proposes a novel learning method for audio captioning, which we call Audio Difference Learning. The core idea is to construct a feature space where differences between two audio inputs are explicitly represented as feature differences. This method has two main components. First, we introduce a diff block, which is placed between the audio encoder and text decoder. The diff block computes the difference between the features of an input audio clip and an additional reference audio clip. The text decoder then generates text descriptions based on the difference features. Second, we use a mixture of the original input audio and reference audio as a new input to eliminate the need for explicit difference annotations. The diff block then calculates the difference between the mixed audios embeddings and those of the reference audio. This difference embedding effectively cancels out the reference audio, leaving only information from the original audio input. Consequently, the model can learn to caption this difference using the original input audios caption, thus removing the need for additional difference annotations. In experiments conducted using the Clotho and ESC50 datasets, the proposed method achieved an 8% improvement in the SPIDEr score compared to conventional methods.

DOI:10.1561/116.20250021