This webpage code was adapted from this source code.
We study the task of Composed Image Retrieval (CoIR), where a query is composed of two modalities, image and text, extending the user's expression ability. Previous methods typically address this task by a separate encoding of each query modality, followed by late fusion of the extracted features. In this paper, we propose a new approach, Cross-Attention driven Shift Encoder (CASE), employing early fusion between modalities through a cross-attention module with an additional auxiliary task. We show that our method outperforms the existing state-of-the-art, on established benchmarks (FashionIQ and CIRR) by a large margin. However, CoIR datasets are a few orders of magnitude smaller compared to other vision and language (V&L) datasets, and some suffer from serious flaws (e.g., queries with a redundant modality). We address these shortcomings by introducing Large Scale Composed Image Retrieval (LaSCo), a new CoIR dataset x10 times larger than current ones. Pre-training on LaSCo yields a further performance boost. We further suggest a new analysis of CoIR datasets and methods, for detecting modality redundancy or necessity, in queries.
In this study, we introduce a new large scale dataset for CoIR, dubbed LaSCo (Large Scale Composed Image Retrieval dataset). To construct it with minimal human effort, we employ a simple and effective methodology to rephrase labels from an existing large scale VQA dataset into a form suited for CoIR. LaSCo contains an open and broad domain of natural images and rich text. Compared to CIRR, it has ×10 more queries, ×2 more unique tokens and ×17 more corpus images. LaSCo further shows a significantly smaller bias towards a single modality for retrieval. Furthermore, pre-training our CASE model on LaSCo boosts performance on CIRR dataset, even at zero shot. VQA 2.0 dataset to create LaSCo with minimal human effort.
We introduce a new approach for image retrieval with composed vision-language queries (CoIR), named Cross-Attention driven Shift Encoder (CASE). The CASE architecture consists of two transformer components. The first is our shift-encoder, based on an image-grounded text encoder. It is a BERT encoder with additional intermediate cross-attention layers, to model vision-language interactions. The second component is a ViT encoder. ViT divides an input image into patches and encodes them as a sequence of features, with an additional [CLS ] token to represent the global image feature. Following this concept, we refer to these encodings as image-tokens. The image tokens are then fed into our cross-attention layers, allowing interaction between the lingual and visual branches. The output, a bi-modality conditioned sequence (text on image and image on text), is then pooled to a single vector and projected to a 256D latent space.