CLIPPR: Improving Zero-Shot Models with Label Distribution Priors

School of Computer Science and Engineering
The Hebrew University of Jerusalem, Israel

Abstract

Labeling large image datasets with attributes such as facial age or object type is tedious and sometimes infeasible. Supervised machine learning methods provide a highly accurate solution, but require manual labels which are often unavailable. Zero-shot models (e.g., CLIP) do not require manual labels but are not as accurate as supervised ones, particularly when the attribute is numeric. We propose a new approach, CLIPPR (CLIP with Priors), which adapts zero-shot models for regression and classification on unlabelled datasets. Our method does not use any annotated images. Instead, we assume a prior over the label distribution in the dataset. We then train an adapter network on top of CLIP under two competing objectives: i) minimal change of predictions from the original CLIP model ii) minimal distance between predicted and prior distribution of labels. Additionally, we present a novel approach for selecting prompts for Vision & Language models using a distributional prior. Our method is effective and presents a significant improvement over the original model. We demonstrate an improvement of 28% in mean absolute error on the UTK age regression task. We also present promising results for classification benchmarks, improving the classification accuracy on the ImageNet dataset by 2.83%, without using any labels.

Our Method

We train an adapter module on top of a frozen Vision & Language (V&L) model image encoder, with two competing objectives: (i) Predicting labels close to the original V&L model zero-shot predictions (ii) Predicting a labels distribution similar to the given prior distribution. Together, these two objectives adapt the original zero-shot predictions to the distributional prior, resulting in better performance.

Our method learns an adapter module according to the expected prior distribution and the initial zero-shot predictions.

Our method adapts the zero-shot V&L model using a prior on the distribution of labels.

Automatic Prompt Selection

We present a novel method for selecting effective task-specific prompts. Provided a set of possible prompts, we extract CLIP zero-shot predictions for all images in the dataset, for each prompt. We compare the different prompts by the Wasserstein distance between (i) The predicted distribution of labels. (ii) The prior distribution for labels. We use this criterion to predict the zero-shot accuracy, both for CLIP and for our method.

An image showing a comparison between distributions created by different captions

In each plot we present: i) The age distribution predicted on the UTK dataset by CLIP using the prompt in the title. ii) The Wasserstein distance between the distributions (W, floating box).

An image showing a table of zero-shot results using different prompts

The smaller the Wasserstein distance between the distributions, the better the prompt predicts the ground-truth labels.

Robustness to Prior Inaccuracies

Our method relies on an estimate of the prior distribution of labels which may be inaccurate. Therefore we want our method to be robust to inaccuracies in the prior distribution. Even with highly inaccurate priors, our method achieves a significant performance improvement over CLIP.

An image comparing our prior to the actual distribution of labels

Top: The ground truth distribution of ages in the UTK dataset (blue), the used prior distribution (red). Bottom: The zero-shot predictions by CLIP (green), and by our method, CLIPPR (orange).

An image of a table showing the effect of prior inaccuracies on our method

Robustness to prior distribution inaccuracies (Accuracy %). Large performance gaps are colored red, and small performance gaps are colored gray.

@article{kahana2022clippr, title={Improving Zero-Shot Models with Label Distribution Priors}, author={Kahana, Jonathan and Cohen, Niv and Hoshen, Yedid}, journal={arXiv preprint arXiv:2212.00784}, year={2022} }