APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 4

Image-Text Retrieval via Green Explainable Multi-modal Alignment (GEMMA)

Tsung-Shan Yang, University of Southern California, USA, tsungsha@usc.edu , Yun-Cheng Wang, University of Southern California, USA, Chengwei Wei, University of Southern California, USA, Suya You, DEVCOM Army Research Laboratory, USA, C.-C. Jay Kuo, University of Southern California, USA
 
Suggested Citation
Tsung-Shan Yang, Yun-Cheng Wang, Chengwei Wei, Suya You and C.-C. Jay Kuo (2025), "Image-Text Retrieval via Green Explainable Multi-modal Alignment (GEMMA)", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 4, e300. http://dx.doi.org/10.1561/116.20250009

Publication Date: 28 Oct 2025
© 2025 T.-S Yang, Y.-C. Wang, C. Wei, S. You and C.-C. J. Kuo
 
Subjects
Image and video retrieval,  Multimodal signal processing,  Image and video processing,  Information retrieval
 
Keywords
Image-text retrievalmulti-modal alignmentgreen learningimage understanding
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 1 times

In this article:
Introduction 
Related Work 
Proposed GEMMA Method 
Experiments 
Conclusion and Future Work 
Acknowledgments 
References 

Abstract

Image-text retrieval is a fundamental task in image understanding. The algorithm fetches the most relevant counterpart in the other modality by giving the image or text. Large visual-language models are trained by paired image and text data to extract the joint representations. However, they are computationally expensive and not explainable regarding how the data from different modalities are aligned. To this end, we propose an efficient and stage-wise alignment for image and text representations, called the Green Explainable Multi-Modal Alignment (GEMMA). GEMMA is computationally efficient by reducing trainable parameters to 3% compared to fine-tuning all image and text encoders. The intermediate clustering results demonstrate the explainability of the alignment mechanism in our model. Experiments show that GEMMA outperforms state-of-the-art retrieval models in text-to-image and imageto- text retrieval tasks on the Flick30k and MS-COCO datasets. GEMMA can also be generalized to unseen image-text pairs from pre-trained visual and text encoders separately.

DOI:10.1561/116.20250009

Companion

APSIPA Transactions on Signal and Information Processing Special Issue - Invited Papers from APSIPA ASC 2024
See the other articles that are part of this special issue.