This webpage code was adapted from this source code.
The task of Composed Image Retrieval (CoIR) involves queries that combine image and text modalities, allowing users to express their intent more effectively. However, current CoIR datasets are orders of magnitude smaller compared to other vision and language (V&L) datasets. Additionally, some of these datasets have noticeable issues, such as queries containing redundant modalities. To address these shortcomings, we introduce the Large Scale Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times larger than existing ones. Pre-training on our LaSCo, shows a noteworthy improvement in performance, even in zero-shot. Furthermore, we propose a new approach for analyzing CoIR datasets and methods, which detects modality redundancy or necessity, in queries. We also introduce a new CoIR baseline, the Cross-Attention driven Shift Encoder (CASE). This baseline allows for early fusion of modalities using a cross-attention module and employs an additional auxiliary task during training. Our experiments demonstrate that this new baseline outperforms the current state-of-the-art methods on established benchmarks like FashionIQ and CIRR.
In this study, we introduce a new large scale dataset for CoIR, dubbed LaSCo (Large Scale Composed Image Retrieval dataset). To construct it with minimal human effort, we employ a simple and effective methodology to rephrase labels from an existing large scale VQA dataset into a form suited for CoIR. LaSCo contains an open and broad domain of natural images and rich text. Compared to CIRR, it has ×10 more queries, ×2 more unique tokens and ×17 more corpus images. LaSCo further shows a significantly smaller bias towards a single modality for retrieval. Furthermore, pre-training our CASE model on LaSCo boosts performance on CIRR dataset, even at zero shot. VQA 2.0 dataset to create LaSCo with minimal human effort.
We introduce a new approach for image retrieval with composed vision-language queries (CoIR), named Cross-Attention driven Shift Encoder (CASE). The CASE architecture consists of two transformer components. The first is our shift-encoder, based on an image-grounded text encoder. It is a BERT encoder with additional intermediate cross-attention layers, to model vision-language interactions. The second component is a ViT encoder. ViT divides an input image into patches and encodes them as a sequence of features, with an additional [CLS ] token to represent the global image feature. Following this concept, we refer to these encodings as image-tokens. The image tokens are then fed into our cross-attention layers, allowing interaction between the lingual and visual branches. The output, a bi-modality conditioned sequence (text on image and image on text), is then pooled to a single vector and projected to a 256D latent space.
@article{Levy_Ben-Ari_Darshan_Lischinski_2024,
title={Data Roaming and Quality Assessment for Composed Image Retrieval},
volume={38}, url={https://ojs.aaai.org/index.php/AAAI/article/view/28081},
DOI={10.1609/aaai.v38i4.28081},
abstractNote={The task of Composed Image Retrieval (CoIR) involves queries that combine image and text modalities, allowing users to express their intent more effectively. However, current CoIR datasets are orders of magnitude smaller compared to other vision and language (V&L) datasets. Additionally, some of these datasets have noticeable issues, such as queries containing redundant modalities. To address these shortcomings, we introduce the Large Scale Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times larger than existing ones. Pre-training on our LaSCo, shows a noteworthy improvement in performance, even in zero-shot. Furthermore, we propose a new approach for analyzing CoIR datasets and methods, which detects modality redundancy or necessity, in queries.
We also introduce a new CoIR baseline, the Cross-Attention driven Shift Encoder (CASE). This baseline allows for early fusion of modalities using a cross-attention module and employs an additional auxiliary task during training. Our experiments demonstrate that this new baseline outperforms the current state-of-the-art methods on established benchmarks like FashionIQ and CIRR.},
number={4},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Levy, Matan and Ben-Ari, Rami and Darshan, Nir and Lischinski, Dani},
year={2024},
month={Mar.},
pages={2991-2999}
}
This webpage code was adapted from this source code.