top of page

Understanding CLIP

The Breakthrough AI Model for Vision and Language Tasks


Authors: Sérgio Moisés Macarringue, Nilesh Barla

Published on: May 11, 2023


Introduction to Transferable Visual Models

Transferable Visual models are deep learning models that have been trained on one set of visual data and then transferred to another set of data to perform a specific task. These models learn transferable visual features that can be used to improve the performance of the target task, even if the target data is significantly different from the source data.


There are several types of Transferable Visual models, including:

  1. Convolutional Neural Networks (CNNs): CNNs are a type of deep learning model that is widely used for image classification tasks. They learn a hierarchy of increasingly abstract features that can be transferred to other tasks.

  2. Recurrent Neural Networks (RNNs): RNNs are a type of deep learning model that can be used for sequential data, such as video or text. They can be trained on a source domain and transferred to a target domain to perform tasks such as video classification or language translation.

  3. Generative Adversarial Networks (GANs): GANs are a type of deep learning model that can be used for generative tasks, such as image or video synthesis. They can be trained on a source domain and transferred to a target domain to generate new images or videos that are similar to the target domain.

  4. Siamese Networks: Siamese networks are a type of deep learning model that can be used for similarity or distance metric learning. They can be trained on a source domain and transferred to a target domain to perform face recognition or object tracking tasks.


By using Transferable Visual models, deep learning practitioners can benefit from the transferability of learned features, enabling them to improve performance and efficiency on a variety of visual tasks.


Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a type of deep learning model that are widely used for image and video processing tasks. They are inspired by the structure and function of the visual cortex in animals, which contains neurons that respond to specific features of visual stimuli.


Source: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00444-


The basic building block of a CNN is a convolutional layer, which applies a set of filters to an input image to extract features at different scales and orientations. The filters are learned through backpropagation, where the model adjusts the filter weights to minimize the error between the predicted and actual output.


In addition to convolutional layers, CNNs also typically include pooling layers, which downsample the feature maps to reduce their dimensionality, and fully connected layers, which combine the features extracted from the convolutional and pooling layers to make a prediction or classification.


CNNs are particularly effective for image and video processing tasks because they are able to learn hierarchical representations of the input data. The lower layers of the network learn low-level features such as edges and corners, while the higher layers learn more complex and abstract features such as object parts and textures.


CNNs have achieved state-of-the-art performance on a wide range of image and video processing tasks, including image classification, object detection, segmentation, and generation. They have many potential applications in areas such as healthcare, robotics, and autonomous vehicles.


Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are a type of deep learning model that are widely used for sequential data processing tasks, such as language modelling, speech recognition, and text generation. Unlike feedforward neural networks, which process a fixed-length input, RNNs are able to process sequences of arbitrary length, making them particularly well-suited for tasks involving variable-length input.


Recurrent neural network. (2023, May 7). In Wikipedia. https://en.wikipedia.org/wiki/Recurrent_neural_network

The basic building block of an RNN is a recurrent layer, which takes as input a sequence of vectors and produces a sequence of hidden states. The hidden state at each time step is computed based on the input vector at that time step and the previous hidden state, using a set of learned weights.


RNNs are particularly effective for sequence modelling tasks because they are able to capture long-term dependencies in the input data. The hidden state at each time step contains information about all previous time steps, allowing the network to remember important features of the input sequence even when they occur far apart in time.


One of the challenges with RNNs is the vanishing gradient problem, which can occur when the network is trained using backpropagation through time. This problem arises because the gradient of the loss function with respect to the weights of the network can become very small as it is propagated through many time steps, making it difficult to update the weights effectively.


To address this problem, several variants of RNNs have been developed, including Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs), which incorporate mechanisms to selectively retain and forget information in the hidden state. These variants have been shown to be more effective than traditional RNNs for many sequence modeling tasks.


RNNs have achieved state-of-the-art performance on a wide range of sequential data processing tasks, including language translation, sentiment analysis, and speech recognition. They have many potential applications in areas such as natural language processing, speech recognition, and video analysis.


Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) are a type of deep learning model that are widely used for generating synthetic data, such as images, music, and text. GANs consist of two neural networks: a generator network that generates synthetic data, and a discriminator network that evaluates the authenticity of the generated data.


Aggarwal, A., Mittal, M., & Battineni, G. (2021). Generative adversarial network: An overview of theory and applications. International Journal of Information Management Data Insights, 1(1), 100004. https://doi.org/10.1016/j.jjimei.2020.100004

The generator network takes as input a random noise vector and produces a synthetic output, such as an image. The discriminator network takes as input the synthetic output and a real input, such as an image from a dataset, and predicts which input is real and which is fake. During training, the generator network is trained to generate synthetic outputs that are able to fool the discriminator network into thinking that they are real, while the discriminator network is trained to correctly distinguish between real and fake inputs.


One of the key benefits of GANs is their ability to generate highly realistic and diverse synthetic data. By training the generator network to mimic the distribution of the real data, GANs can produce synthetic outputs that are indistinguishable from real data to a human observer.


GANs have many potential applications in areas such as image and video generation, data augmentation, and data privacy. They have been used to generate photorealistic images of faces, animals, and landscapes, and to generate music and text that closely mimic human-created content.


One of the challenges with GANs is that they can be difficult to train, as the generator and discriminator networks can become stuck in a "cat-and-mouse" game where the generator network is unable to produce realistic outputs, and the discriminator network becomes too good at identifying fake data. To address this problem, several variants of GANs have been developed, including Conditional GANs (CGANs), Wasserstein GANs (WGANs), and StyleGANs, which incorporate additional constraints or modifications to the original GAN architecture.


Siamese Networks

Siamese Networks are a type of neural network architecture that are commonly used for tasks such as similarity matching and verification. Siamese Networks consist of two or more identical neural networks, which share the same weights and are trained to learn representations of input data.


Siamese Networks take as input two or more data points and produce a similarity score between them. For example, in the case of image similarity, the network takes as input two images and produces a similarity score between them, indicating how similar the two images are. The network is trained using a loss function that encourages the network to produce high similarity scores for pairs of similar data points, and low similarity scores for pairs of dissimilar data points.


The key advantage of Siamese Networks is their ability to learn discriminative representations of input data, which can be used to measure the similarity between pairs of data points. By sharing the weights between the two or more networks, Siamese Networks are able to learn representations that are invariant to changes in input data, such as changes in lighting or background.


Siamese Networks have been applied to a wide range of tasks, including image matching, face verification, text similarity, and speech recognition. They have also been used in areas such as information retrieval, recommendation systems, and fraud detection.


One of the challenges with Siamese Networks is that they can be computationally expensive to train, as the network must be trained on a large number of pairs of data points to learn effective representations. Additionally, the performance of Siamese Networks is highly dependent on the quality and diversity of the training data, as well as the choice of network architecture and hyperparameters.



Natural language supervision

Natural language supervision refers to the use of natural language text as a form of supervision for machine learning models. This type of supervision is typically used in tasks that require understanding and processing natural languages, such as text classification, sentiment analysis, and language translation.


Natural language supervision can take many forms, such as:

  1. Human-labeled data: In this form of supervision, humans manually label text data with the correct class or sentiment, providing a labeled dataset for machine learning models to learn from.

  2. Rule-based systems: Rule-based systems use pre-defined rules to classify text based on certain patterns or keywords. These systems can be used to generate labeled datasets for machine learning models to learn from.

  3. Active learning: Active learning involves using human feedback to guide the training of a machine learning model. In this approach, the model selects the most informative examples for humans to label and uses this labeled data to improve its performance.

  4. Weak supervision: Weak supervision involves using noisy or incomplete labels to train a machine learning model. This approach is useful when labeled data is scarce or expensive, and can involve using heuristics or external knowledge sources to generate labels.

Natural language supervision is a powerful technique for training machine learning models to understand and process natural language. It enables models to learn from large amounts of unlabelled data, and can improve performance on a wide range of natural language tasks.


CLIP: Learning Transferable Visual Models from Natural Language Supervision


CLIP is an object identification model created by OpenAI, known for its GPT3, and released in February 2021. In tradition, classification models are essentially models that are used to identify objects from a predefined list of categories; in the case of the ImageNet challenge, there are approximately 1000 categories. To train CLIP approximately 400 million web-based photos and their related text data were used, which can now identify objects in any category without the need for additional training.


CLIP stands for "Contrastive Language-Image Pre-training". It is a state-of-the-art deep learning model that combines natural language processing and computer vision to understand and reason images and text together.


CLIP is a transformer-based model that has been pre-trained on a large amount of text and image data. The model is trained using a contrastive learning approach, where it learns to associate images and text that are semantically related while contrasting them with other images and text that are unrelated. Although an expensive training method it allows the model to learn to understand the relationship between different concepts in images and text and to use this understanding to perform a wide range of tasks, such as image classification, object detection, and image generation.


Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ArXiv.
To determine which photos were linked with which sentences in our dataset, CLIP pre-trains an image encoder and a text encoder. This behavior is then used to transform CLIP into a zero-shot classifier. It transforms each class in a dataset into a caption, such as "a picture of a dog," and then predicts which class of caption CLIP believes best matches a given image.

One of the key features of CLIP is its ability to perform zero-shot learning, meaning that it can recognize and classify objects or concepts that it has never seen before, simply by being given a natural language prompt describing the object or concept. This is made possible by the model's ability to understand and reason about natural language descriptions and connect them to visual features in images.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ArXiv.

CLIP has achieved state-of-the-art performance on several benchmark image classification tasks, and its ability to understand and reason images and text together has many potential applications in areas such as image search, medicine, recommendation systems, natural language image editing, et cetera.


Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ArXiv.

From the graph above, we can see that in the zero-shot ImageNet accuracy, CLIP outperformed image-to-caption language models. It can also be seen that CLIP is roughly 10X faster than other models.

As mentioned above the effectiveness of CLIP comes from a contrastive learning approach. Here the model jointly trains a neural network on both images and text.


Specifically, CLIP is trained to associate images and their corresponding captions, by learning to maximize the similarity score between each image and its associated text description, while minimizing the similarity score between the image and all other captions, and between the caption and all other images.

CLIP uses a multi-modal transformer architecture, which is a type of neural network that is able to process both text and image inputs. The model is pre-trained on a large corpus of image-text pairs, using a self-supervised learning approach, where the model learns to predict the captions associated with an image, and the image associated with a caption, without any explicit supervision.


Once the model has been pre-trained, it can be fine-tuned on a wide range of downstream tasks, such as image classification, object detection, image captioning, natural language inference, and question answering.


During fine-tuning, the model is adapted to a specific task by adding a small task-specific layer on top of the pre-trained model and training the entire network on a labeled dataset for the target task.


CLIP has achieved state-of-the-art performance on several benchmarks, including the ImageNet classification challenge and the COCO object detection challenge, as well as on a wide range of natural language understanding tasks. Its success is attributed to its ability to learn rich multi-modal representations of images and text, which can be used to perform a wide range of complex tasks.


Multi-modal transformer architectures have been used in a wide range of tasks, such as image captioning, visual question answering, and image classification. They have been shown to be effective at capturing the complex interactions between different modalities and to learn rich representations that can be used to perform a wide range of natural language understanding and computer vision tasks.


Akbari, H., Yuan, L., Qian, R., Chuang, W., Chang, S., Cui, Y., & Gong, B. (2021). VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. ArXiv. /abs/2104.11178

The image above is an example of how a transformer architecture can be used to process multiple input formats together. As you can see that the representations or features from various transformer modules are aligned close together by minimizing the loss.


open-CLIP: Is the code open-sourced?

Unfortunately, OpenAI has not open-sourced the code. But they provide an API through which we can access the algorithm and play with it.


In order to use CLIP, you need to install it using the following command:

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git


Once installed you then explore and maybe understand it using the following code.

import torch 
import clip

from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"


OpenAI provides two architectures ResNet and ViT. The table below will give you details of the models and the hyperparameter used to train the model.


Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ArXiv. /abs/2103.00020

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ArXiv. /abs/2103.00020

This example uses "ViT-B/32"

model, preprocess = clip.load("ViT-B/32", device=device)  

Once the model is downloaded you need to encode (the desired) image and encode the text. Essentially, you want to provide one prompt pertaining to the image i.e. true positive while the rest will be false positive. The idea is that the model after calculating the similarity must be able to classify the image with a correct sentence.


with torch.no_grad():     
    image_features = model.encode_image(image)     
    text_features = model.encode_text(text)
                  
    logits_per_image, logits_per_text = model(image, text)                   
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()  
  
    
print("Label probs:", probs)  
    
# prints: [[0.9927937  0.00421068 0.00299572]]u

But the question remains "how can we evaluate the authenticity of the technology in hand?"


To answer that, a group of researchers and engineers succeeded in building CLIP from scratch by interpreting the literature. They named it open-CLIP. In their paper, they claimed to replicate the architecture and interestingly the authors found the similar results which they published in their paper "Reproducible scaling laws for contrastive language-image learning".


The open-CLIP produced the following results in zero-shot classification.

Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., & Jitsev, J. (2022). Reproducible scaling laws for contrastive language-image learning. ArXiv. /abs/2212.07143

Here are some important links pertaining to open-CLIP:

  1. Different models are available here.

  2. Open-sourced open-CLIP is available here.

Similar to CLIP the pre-trained open-CLIP can be installed and used in the following manner:

!pip install open_clip_torch

Usage:

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
tokenizer = open_clip.get_tokenizer('ViT-B-32')

Similar encoding method as what we saw in the OpenAI's CLIP.

image = preprocess(Image.open("CLIP.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

The author's introduced a different and separate feature extraction from the original one. Both the image and text encoder are fed into a separate models and text probability is calculated by performing a matrix multiplication between the two. Finally, probability scores are calculated using the softmax function.

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1., 0., 0.]]

Applications of CLIP

CLIP, although being in experimental stage has a lot to offer. For instance, in Medicine, it can trained on radiology dataset. Here every scan and it operative notes can be fed into the system and during the inference it can produce an operative note when the scan is fed into the system. It can be used in telemedicine to enhance the diagnosis process.


In Arts and Design, CLIP has been explored to generate high quality images based on an input prompt. Applications like Dalle, Midjourney, Adobe Firefly are few of such examples.


In another example, CLIP has been leveraged to generate music as well. In the very recent event at Google I/O, Sundar Pichai announced MusicLM -- a large language model that can generate a wide variety of music by a text prompt.



Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N., & Frank, C. (2023). MusicLM: Generating Music From Text. ArXiv. /abs/2301.11325

CLIP can be used for Content Moderation as well. It can be leveraged the caption of the image to classify the nature of the image. Any inappropriate images can be flagged that doesn't match the community guidelines.


In E-commerce and Image search engines can be used to improve product recommendation and improve the search quality by better approximating search image with given prompt.


The possibilities are endless. It is only a matter of exploring and understand the relationship image and text.


Ethical Concerns

With a complex and sophisticated system the concerns pertaining to authenticity and ethnicity will arise. Why? Because a complex system such this will require huge amount of datasets. And these dataset will have information that is inappropriate, private and sensitive to community. Sometimes the work of the real authors or artist are being scrapped without their knowledge and used to train these system. To sum here are some of ethical concerns:

  1. Bias: If the training data contains biases i.e. inclination toward a particular information, the AI model will likely learn and reproduce those biases. Such information will include imbalance between gender and racial sample, or even cultural samples. For instance, if a model was trained largely on images from Western countries, it might not perform as well when presented with images that reflect the customs, people, or languages of non-Western countries.

  2. Fairness: Related to bias is the issue of fairness. If the AI model doesn't perform equally well for all groups of people, it could unfairly disadvantage certain groups. This is a significant concern in applications like hiring or loan approval, where AI models are sometimes used to screen candidates or applicants.

  3. Privacy: System such as CLIP can potentially be used to infer sensitive information from images or text, which could raise privacy concerns. For instance, a model might be able to infer someone's race, gender, or age from an image, even if that information wasn't explicitly provided.

  4. Transparency: AI models, particularly deep learning models like CLIP, are often seen as "black boxes" because their internal workings are hard to understand. This lack of transparency can make it difficult to understand why a model made a particular decision, which can be a problem in contexts where explanations are required or desirable.

  5. Misuse: Like any technology, AI models can be misused. For instance, a model like CLIP could potentially be used to generate misleading or harmful content, such as deepfakes or propaganda.

To mitigate these and other ethical concerns, it's important to use best practices in AI ethics when developing and deploying models like CLIP. This could include things like conducting bias audits, being transparent about the limitations of the model, and implementing robust privacy protections. Moreover, organizations should foster a culture of ethical awareness and responsibility, and should engage with external stakeholders, including the public, to ensure a broad range of perspectives are considered.


Conclusion

System like CLIP gives us an idea of how two different domains can be brought together using a common embedding space. It gives us an idea of how both natural language understanding and image interpretation can be combined. The use of a contrastive learning strategy to jointly train a neural network on both images and text is the main innovation of CLIP. By learning to maximise the similarity score between each image and its associated text description and to minimise the similarity score between the image and all other captions as well as between the caption and all other images, CLIP is specifically trained to associate images and their corresponding captions.


A multi-modal transformer architecture, a class of neural network that can handle both text and image inputs, is used by CLIP. Using a self-supervised learning strategy, the model is pre-trained on a sizable corpus of image-text pairs. This allows the model to learn to predict the captions that go with each image and the image that goes with each caption without any explicit supervision. After the model has been pre-trained, a variety of downstream tasks, including question answering, object detection, image captioning, classification of images, and natural language inference, can be improved upon.


Developments are still happening and new methods and approaches are being introduced to enhance the quality and accuracy of the result it yeilds.


Reference

  • CLIP: Connecting text and images [Blog]

  • Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., & Jitsev, J. (2022). Reproducible scaling laws for contrastive language-image learning. ArXiv. /abs/2212.07143

  • Akbari, H., Yuan, L., Qian, R., Chuang, W., Chang, S., Cui, Y., & Gong, B. (2021). VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. ArXiv. /abs/2104.11178

  • Bechmann, A., & Bowker, G. C. (2019). Unsupervised by any other name: Hidden layers of knowledge production in artificial intelligence on social media. Big Data & Society. https://doi.org/10.1177/2053951718819569

  • Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N., & Frank, C. (2023). MusicLM: Generating Music From Text. ArXiv. /abs/2301.11325


bottom of page