This paper proposes a novel learning method for audio captioning, which we call Audio Difference Learning. The core idea is to construct a feature space where differences between two audio inputs are explicitly represented as feature differences. This method has two main components. First, we introduce a diff block, which is placed between the audio encoder and text decoder. The diff block computes the difference between the features of an input audio clip and an additional reference audio clip. The text decoder then generates text descriptions based on the difference features. Second, we use a mixture of the original input audio and reference audio as a new input to eliminate the need for explicit difference annotations. The diff block then calculates the difference between the mixed audios embeddings and those of the reference audio. This difference embedding effectively cancels out the reference audio, leaving only information from the original audio input. Consequently, the model can learn to caption this difference using the original input audios caption, thus removing the need for additional difference annotations. In experiments conducted using the Clotho and ESC50 datasets, the proposed method achieved an 8% improvement in the SPIDEr score compared to conventional methods.