.. _sec_automm_clip_zeroshot_imgcls:

CLIP in AutoMM - Zero-Shot Image Classification
===============================================


When you want to classify an image to different classes, it is standard
to train an image classifier based on the class names. However, it is
tedious to collect training data. And if the collected data is too few
or too imbalanced, you may not get a decent image classifier. So you
wonder, is there a strong enough model that can handle this situaton
without the training efforts?

Actually there is! OpenAI has introduced a model named
`CLIP <https://openai.com/blog/clip/>`__, which can be applied to any
visual classification benchmark by simply providing the names of the
visual categories to be recognized. And its accuracy is high, e.g., CLIP
can achieve 76.2% top-1 accuracy on ImageNet without using any of the
1.28M training samples. This performance matches with original
supervised ResNet50 on ImageNet, quite promising for a classification
task with 1000 classes!

So in this tutorial, let’s dive deep into CLIP. We will show you how to
use CLIP model to do zero-shot image classification in AutoGluon.

Simple Demo
-----------

Here we provide a simple demo to classify what dog breed is in the
picture below.

.. code:: python

    from IPython.display import Image, display
    from autogluon.multimodal import download
    
    url = "https://farm4.staticflickr.com/3445/3262471985_ed886bf61a_z.jpg"
    dog_image = download(url)
    
    pil_img = Image(filename=dog_image)
    display(pil_img)


.. parsed-literal::
    :class: output

    Downloading 3262471985_ed886bf61a_z.jpg from https://farm4.staticflickr.com/3445/3262471985_ed886bf61a_z.jpg...


.. parsed-literal::
    :class: output

                         
.. figure:: output_clip_zeroshot_aa275c_1_2.jpg


Normally to solve this task, you need to collect some training data
(e.g., `the Stanford Dogs
dataset <http://vision.stanford.edu/aditya86/ImageNetDogs/>`__) and
train a dog breed classifier. But with CLIP, all you need to do is
provide some potential visual categories. CLIP will handle the rest for
you.

.. code:: python

    from autogluon.multimodal import MultiModalPredictor
    
    predictor = MultiModalPredictor(hyperparameters={"model.names": ["clip"]}, problem_type="zero_shot")
    prob = predictor.predict_proba({"image": [dog_image]}, {"text": ['This is a Husky', 'This is a Golden Retriever', 'This is a German Sheperd', 'This is a Samoyed.']})
    print("Label probs:", prob)


.. parsed-literal::
    :class: output

    Label probs: [[5.6800199e-01 3.4310840e-04 4.1589692e-01 1.5758043e-02]]


Clearly, according to the probabilities, we know there is a Husky in the
photo (which I think is correct)!

Let’s try a harder example. Below is a photo of two Segways. This object
class is not common in most existing vision datasets.

.. code:: python

    url = "https://live.staticflickr.com/7236/7114602897_9cf00b2820_b.jpg"
    segway_image = download(url)
    
    pil_img = Image(filename=segway_image)
    display(pil_img)


.. parsed-literal::
    :class: output

    Downloading 7114602897_9cf00b2820_b.jpg from https://live.staticflickr.com/7236/7114602897_9cf00b2820_b.jpg...


.. parsed-literal::
    :class: output

                         
.. figure:: output_clip_zeroshot_aa275c_5_2.jpg


Given several text queries, CLIP can still predict the segway class
correctly with high confidence.

.. code:: python

    prob = predictor.predict_proba({"image": [segway_image]}, {"text": ['segway', 'bicycle', 'wheel', 'car']})
    print("Label probs:", prob)


.. parsed-literal::
    :class: output

    Label probs: [[9.9997151e-01 5.8744063e-06 2.0352767e-05 2.2921422e-06]]


This is amazing, right? Now a bit knowledge on why and how CLIP works.
CLIP is called Contrastive Language-Image Pre-training. It is trained on
a massive number of data (400M image-text pairs). By using a simple loss
objective, CLIP tries to predict which out of a set of randomly sampled
text is actually paired with an given image in the training dateset. As
a result, CLIP models can then be applied to nearly arbitrary visual
classification tasks just like the examples we have shown above.

More about CLIP
---------------

CLIP is powerful, and it was designed to mitigate a number of major
problems in the standard deep learning approach to computer vision, such
as costly datasets, closed set prediction and poor generalization
performance. CLIP is a good solution to many problems, however, it is
not the ultimate solution. CLIP has its own limitations. For example,
CLIP is vulnerable to typographic attacks, i.e., if you add some text to
an image, CLIP’s predictions will be easily affected by the text. Let’s
see one example from OpenAI’s blog post on `multimodal
neurons <https://openai.com/blog/multimodal-neurons/>`__.

Suppose we have a photo of a Granny Smith apple,

.. code:: python

    url = "https://cdn.openai.com/multimodal-neurons/assets/apple/apple-blank.jpg"
    image_path = download(url)
    
    pil_img = Image(filename=image_path)
    display(pil_img)


.. parsed-literal::
    :class: output

    Downloading apple-blank.jpg from https://cdn.openai.com/multimodal-neurons/assets/apple/apple-blank.jpg...


.. parsed-literal::
    :class: output

                                                
.. figure:: output_clip_zeroshot_aa275c_9_2.jpg


We then try to classify this image to several classes, such as Granny
Smith, iPod, library, pizza, toaster and dough.

.. code:: python

    prob = predictor.predict_proba({"image": [image_path]}, {"text": ['Granny Smith', 'iPod', 'library', 'pizza', 'toaster', 'dough']})
    print("Label probs:", prob)


.. parsed-literal::
    :class: output

    Label probs: [[9.9851769e-01 1.2587471e-03 1.6506421e-05 4.3621585e-05 7.1467890e-05
      9.1955466e-05]]


We can see that zero-shot classification works great, it predicts apple
with almost 100% confidence. But if we add a text to the apple like
this,

.. code:: python

    url = "https://cdn.openai.com/multimodal-neurons/assets/apple/apple-ipod.jpg"
    image_path = download(url)
    
    pil_img = Image(filename=image_path)
    display(pil_img)


.. parsed-literal::
    :class: output

    Downloading apple-ipod.jpg from https://cdn.openai.com/multimodal-neurons/assets/apple/apple-ipod.jpg...


.. parsed-literal::
    :class: output

                                                
.. figure:: output_clip_zeroshot_aa275c_13_2.jpg


Then we use the same class names to perform zero-shot classification,

.. code:: python

    prob = predictor.predict_proba({"image": [image_path]}, {"text": ['Granny Smith', 'iPod', 'library', 'pizza', 'toaster', 'dough']})
    print("Label probs:", prob)


.. parsed-literal::
    :class: output

    Label probs: [[2.4376834e-02 9.7554350e-01 2.7068463e-06 8.9303450e-07 2.8151650e-05
      4.7902555e-05]]


Suddenly, the apple becomes iPod.

CLIP also has other limitations. If you are interested, you can read
`CLIP paper <https://arxiv.org/abs/2103.00020>`__ for more details. Or
you can stay here, play with your own examples!

Other Examples
--------------

You may go to `AutoMM
Examples <https://github.com/awslabs/autogluon/tree/master/examples/automm>`__
to explore other examples about AutoMM.

Customization
-------------

To learn how to customize AutoMM, please refer to
:ref:`sec_automm_customization`.