Image-text retrieval is a fundamental task in image understanding. The algorithm fetches the most relevant counterpart in the other modality by giving the image or text. Large visual-language models are trained by paired image and text data to extract the joint representations. However, they are computationally expensive and not explainable regarding how the data from different modalities are aligned. To this end, we propose an efficient and stage-wise alignment for image and text representations, called the Green Explainable Multi-Modal Alignment (GEMMA). GEMMA is computationally efficient by reducing trainable parameters to 3% compared to fine-tuning all image and text encoders. The intermediate clustering results demonstrate the explainability of the alignment mechanism in our model. Experiments show that GEMMA outperforms state-of-the-art retrieval models in text-to-image and imageto- text retrieval tasks on the Flick30k and MS-COCO datasets. GEMMA can also be generalized to unseen image-text pairs from pre-trained visual and text encoders separately.
Companion
APSIPA Transactions on Signal and Information Processing Special Issue - Invited Papers from APSIPA ASC 2024
See the other articles that are part of this special issue.