now publishers - Audio Difference Learning Framework for Audio Captioning

APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Audio Difference Learning Framework for Audio Captioning

Tatsuya Komatsu, LY Corporation, Japan AND Nagoya University, Japan, komatsu.tatsuya@lycorp.co.jp , Kazuya Takeda, Nagoya University, Japan, Tomoki Toda, Nagoya University, Japan

Suggested Citation

Tatsuya Komatsu, Kazuya Takeda and Tomoki Toda (2025), "Audio Difference Learning Framework for Audio Captioning", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e34. http://dx.doi.org/10.1561/116.20250021

Publication Date: 20 Nov 2025

Subjects

Audio signal processing, Speech and spoken language processing, Multimodal signal processing, Statistical/Machine learning, Pattern recognition and learning, Deep learning

Keywords

Audio captioning, audio difference captioning, audio difference learning

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 50 times

In this article:

Abstract

This paper proposes a novel learning method for audio captioning, which we call Audio Difference Learning. The core idea is to construct a feature space where differences between two audio inputs are explicitly represented as feature differences. This method has two main components. First, we introduce a diff block, which is placed between the audio encoder and text decoder. The diff block computes the difference between the features of an input audio clip and an additional reference audio clip. The text decoder then generates text descriptions based on the difference features. Second, we use a mixture of the original input audio and reference audio as a new input to eliminate the need for explicit difference annotations. The diff block then calculates the difference between the mixed audios embeddings and those of the reference audio. This difference embedding effectively cancels out the reference audio, leaving only information from the original audio input. Consequently, the model can learn to caption this difference using the original input audios caption, thus removing the need for additional difference annotations. In experiments conducted using the Clotho and ESC50 datasets, the proposed method achieved an 8% improvement in the SPIDEr score compared to conventional methods.

DOI:10.1561/116.20250021

Introduction
Related Works
Audio Captioning
Proposed Method: Audio Difference Learning
Experimental Evaluations
Conclusion
References

Audio Difference Learning Framework for Audio Captioning

Share

Journal details

Abstract