Contrastive Language-Image Pretraining (CLIP)

Posted on August 23, 2024

5 minute read

Read the CLIP paper the other day and took these high-level notes.

I highly recommend reading the paper yourself if you have the time.

Overview:

  • CLIP is a vision-language model trained by OpenAI in 2021 that was a big breakthrough in vision-language models
  • It consists of a text encoder and a vision encoder which process image/text into vectors that are semantically meaningful across both modalities

Context

  • What had been the SOTA and working really well in computer vision for a long time was using hand-labeled datasets like ImageNet to train models. These can be viewed as high-quality labels (strong supervision) without a ton of examples (ImageNet is ~15 million)
  • Motivation for natural language supervision
    • What had been working really well in NLP at the time (and still today) was pre-training large models on vast swaths of the internet (scale!) using an autoregressive objective like next-word prediction or Masked Language Modeling (a.k.a. cloze task).
      • Despite the potentially dirty labels, the massive scale of the dataset enabled these models to learn deep patterns that empirically are quite transferable to new tasks. This is why GPT can do quite well zero-shot at tasks like machine translation, forms of classification, etc.
    • It's also super nice to be able to interact with a model with text. You don't have to craft/fine-tune specialized heads on top of the model for each new task; you can just write a prompt
    • Finally, the traditional way to train a classifier means you need to define how many classes there are and this constrains the amount of useful information a model can really give you. Natural language is a medium that can encode rich concepts and allows CLIP to learn "a much wider set of visual concepts"
  • The makers of CLIP wanted to try this type of scale in computer vision.
  • So they made a 400M sample dataset of (image, text) pairs and pre-trained a model (with a contrastive objective I'll get to in a moment) on it

Results:

  • CLIP zero-shot generalizes quite well to all sorts of vision-language tasks such as OCR, action-recognition, classification, geo-localization, etc.
    • "competitive with prior task-specific supervised models"
  • They train 8 models spanning 2 orders of magnitude for scale and show that "transfer performance is a smoothly predictable function of compute" (i.e. scaling laws seem to apply here too)
  • They perform Linear-Probe Representation Learning Analysis: using the image encoder as a frozen-weight feature extractor and training a linear classifier on top using ImageNet.
    • They find that this classifier outperforms the "best publicly available ImageNet model," suggesting that the representations learned are top-notch 👍

Training:

  • Training efficiency (the gain in performance per unit of compute) is very important.
    • They initially trained a text transformer and image CNN jointly to predict the caption of an image. This trains slowly because it tries to predict each word exactly in a caption.
    • They looked at generative models too that create intermediate representations, but those require an order of magnitude more compute than contrastive models with the same performance. So instead they use a contrastive learning objective
  • Contrastive Objective: Given a batch of NN (image, text) pairs, run the images through a vision encoder and the text samples through a text encoder. Push each pair's cross-modal encodings together (by maximizing the cosine similarity) and all other matches apart (by minimizing the cosine similarity).
    • Specifically they use a Symmetric Cross-Entropy Loss (minimizes cross entropy in both directions)
  • They also directly optimize the softmax temperature parameter during training which is neat (with an upper-bound so training is stable)
    • Architecture:
      • Image Encoder: ResNet-50 with modifications
      • Text Encoder: Transformer with minor modification
      • linear projections are learned to map the image & text encodings to the multi-modal encoding space
    • Augmentation: random square crop of images

Related:

  • ALIGN (2021) is a similar vision-language model by Anthropic. They also train at scale with a contrastive learning objective. They used separate architectures for the encoders and used a larger—1B (image, alt-text) pairs—and even noisier dataset, achieving similarly great results.

Pseudo-code of CLIP given in the paper:

PYTHON

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images # T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed # W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter
# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2
homeprojectsblogcontactrésumé

© 2025 Noah Rousell