now publishers - Image-Text Retrieval via Green Explainable Multi-modal Alignment (GEMMA)

APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 4

Image-Text Retrieval via Green Explainable Multi-modal Alignment (GEMMA)

Tsung-Shan Yang, University of Southern California, USA, tsungsha@usc.edu , Yun-Cheng Wang, University of Southern California, USA, Chengwei Wei, University of Southern California, USA, Suya You, DEVCOM Army Research Laboratory, USA, C.-C. Jay Kuo, University of Southern California, USA

Suggested Citation

Tsung-Shan Yang, Yun-Cheng Wang, Chengwei Wei, Suya You and C.-C. Jay Kuo (2025), "Image-Text Retrieval via Green Explainable Multi-modal Alignment (GEMMA)", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 4, e300. http://dx.doi.org/10.1561/116.20250009

Publication Date: 28 Oct 2025

Subjects

Image and video retrieval, Multimodal signal processing, Image and video processing, Information retrieval

Keywords

Image-text retrieval, multi-modal alignment, green learning, image understanding

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 188 times

In this article:

Abstract

Image-text retrieval is a fundamental task in image understanding. The algorithm fetches the most relevant counterpart in the other modality by giving the image or text. Large visual-language models are trained by paired image and text data to extract the joint representations. However, they are computationally expensive and not explainable regarding how the data from different modalities are aligned. To this end, we propose an efficient and stage-wise alignment for image and text representations, called the Green Explainable Multi-Modal Alignment (GEMMA). GEMMA is computationally efficient by reducing trainable parameters to 3% compared to fine-tuning all image and text encoders. The intermediate clustering results demonstrate the explainability of the alignment mechanism in our model. Experiments show that GEMMA outperforms state-of-the-art retrieval models in text-to-image and imageto- text retrieval tasks on the Flick30k and MS-COCO datasets. GEMMA can also be generalized to unseen image-text pairs from pre-trained visual and text encoders separately.

DOI:10.1561/116.20250009

Related publications

Companion

APSIPA Transactions on Signal and Information Processing Special Issue - Invited Papers from APSIPA ASC 2024
See the other articles that are part of this special issue.

Introduction
Related Work
Proposed GEMMA Method
Experiments
Conclusion and Future Work
Acknowledgments
References

Image-Text Retrieval via Green Explainable Multi-modal Alignment (GEMMA)

Share

Journal details

Abstract

Related publications