Our experimental results show a clear improvement over the CLIP zero-shot predictions. Our method presents a substantial improvement in both regression and classification tasks.
Labeling large image datasets with attributes such as facial age or object type is tedious and sometimes infeasible. Supervised machine learning methods provide a highly accurate solution, but require manual labels which are often unavailable. Zero-shot models (e.g., CLIP) do not require manual labels but are not as accurate as supervised ones, particularly when the attribute is numeric. We propose a new approach, CLIPPR (CLIP with Priors), which adapts zero-shot models for regression and classification on unlabelled datasets. Our method does not use any annotated images. Instead, we assume a prior over the label distribution in the dataset. We then train an adapter network on top of CLIP under two competing objectives: i) minimal change of predictions from the original CLIP model ii) minimal distance between predicted and prior distribution of labels. Additionally, we present a novel approach for selecting prompts for Vision & Language models using a distributional prior. Our method is effective and presents a significant improvement over the original model. We demonstrate an improvement of 28% in mean absolute error on the UTK age regression task. We also present promising results for classification benchmarks, improving the classification accuracy on the ImageNet dataset by 2.83%, without using any labels.
We train an adapter module on top of a frozen Vision & Language (V&L) model image encoder, with two competing objectives: (i) Predicting labels close to the original V&L model zero-shot predictions (ii) Predicting a labels distribution similar to the given prior distribution. Together, these two objectives adapt the original zero-shot predictions to the distributional prior, resulting in better performance.
Our experimental results show a clear improvement over the CLIP zero-shot predictions. Our method presents a substantial improvement in both regression and classification tasks.
We present a novel method for selecting effective task-specific prompts. Provided a set of possible prompts, we extract CLIP zero-shot predictions for all images in the dataset, for each prompt. We compare the different prompts by the Wasserstein distance between (i) The predicted distribution of labels. (ii) The prior distribution for labels. We use this criterion to predict the zero-shot accuracy, both for CLIP and for our method.
Our method relies on an estimate of the prior distribution of labels which may be inaccurate. Therefore we want our method to be robust to inaccuracies in the prior distribution. Even with highly inaccurate priors, our method achieves a significant performance improvement over CLIP.
@article{kahana2022clippr,
title={Improving Zero-Shot Models with Label Distribution Priors},
author={Kahana, Jonathan and Cohen, Niv and Hoshen, Yedid},
journal={arXiv preprint arXiv:2212.00784},
year={2022}
}