Data Roaming and Early Fusion for Composed Image Retrieval

Data Roaming and Early Fusion for Composed Image Retrieval

Image and Text    search    Images

Matan Levy1   Rami Ben-Ari2   Nir Darshan2   Dani Lischinski1
1The Hebrew University of Jerusalem, Israel
2OriginAI, Israel
Page map: [Abstract | LaSCo Dataset | The CoIR task  | Retrieval Examples | Figures | CASE Method | Citation]


We study the task of Composed Image Retrieval (CoIR), where a query is composed of two modalities, image and text, extending the user's expression ability. Previous methods typically address this task by a separate encoding of each query modality, followed by late fusion of the extracted features. In this paper, we propose a new approach, Cross-Attention driven Shift Encoder (CASE), employing early fusion between modalities through a cross-attention module with an additional auxiliary task. We show that our method outperforms the existing state-of-the-art, on established benchmarks (FashionIQ and CIRR) by a large margin. However, CoIR datasets are a few orders of magnitude smaller compared to other vision and language (V&L) datasets, and some suffer from serious flaws (e.g., queries with a redundant modality). We address these shortcomings by introducing Large Scale Composed Image Retrieval (LaSCo), a new CoIR dataset x10 times larger than current ones. Pre-training on LaSCo yields a further performance boost. We further suggest a new analysis of CoIR datasets and methods, for detecting modality redundancy or necessity, in queries.

LaSCo Dataset

In this study, we introduce a new large scale dataset for CoIR, dubbed LaSCo (Large Scale Composed Image Retrieval dataset). To construct it with minimal human effort, we employ a simple and effective methodology to rephrase labels from an existing large scale VQA dataset into a form suited for CoIR. LaSCo contains an open and broad domain of natural images and rich text. Compared to CIRR, it has ×10 more queries, ×2 more unique tokens and ×17 more corpus images. LaSCo further shows a significantly smaller bias towards a single modality for retrieval. Furthermore, pre-training our CASE model on LaSCo boosts performance on CIRR dataset, even at zero shot. VQA 2.0 dataset to create LaSCo with minimal human effort.

Retrieval Examples

Composed Image Retrieval (CoIR)

In Composed Image Retrieval (CoIR), the provided query is composed of two modalities, image and text, extending the user’s expression ability.

CoIR task



We introduce a new approach for image retrieval with composed vision-language queries (CoIR), named Cross-Attention driven Shift Encoder (CASE). The CASE architecture consists of two transformer components. The first is our shift-encoder, based on an image-grounded text encoder. It is a BERT encoder with additional intermediate cross-attention layers, to model vision-language interactions. The second component is a ViT encoder. ViT divides an input image into patches and encodes them as a sequence of features, with an additional [CLS ] token to represent the global image feature. Following this concept, we refer to these encodings as image-tokens. The image tokens are then fed into our cross-attention layers, allowing interaction between the lingual and visual branches. The output, a bi-modality conditioned sequence (text on image and image on text), is then pooled to a single vector and projected to a 256D latent space.

Citation - BibTeX

  title={Data Roaming and Early Fusion for Composed Image Retrieval},
  author={Levy, Matan and Ben-Ari, Rami and Darshan, Nir and Lischinski, Dani},
  journal={arXiv preprint arXiv:2303.09429},

This webpage code was adapted from this source code.