.. _sec_automm_clip_embedding:

Image-Text Semantic Matching with AutoMM - Zero-Shot
====================================================


The task of image-text semantic matching refers to measuring the
visual-semantic similarity between an image and a sentence. AutoMM
supports zero-shot image-text matching by leveraging the powerful
`CLIP <https://github.com/openai/CLIP>`__. Thanks to the contrastive
loss objective and trained on millions of image-text pairs, CLIP learns
good embeddings for both vision and language, and their connections.
Hence, we can use it to extract embeddings for retrieval and matching.

CLIP has a two-tower architecture, which means it has two encoders: one
for image, the other for text. An overview of CLIP model can be seen in
the diagram below. Left shows its pre-training stage, and Right shows
its zero-shot predicton stage. By computing the cosine similarity scores
between one image embedding and all the text images, we pick the text
which has the highest similarity as the prediction.

Given the two encoders, we can extract image embeddings, or text
embeddings. And most importantly, embedding extraction can be done
offline, only similarity computation needs to be done online. So this
means good scalability. |CLIP|

In this tutorial, we will show how the AutoMM’s easy-to-use APIs can
ship the powerful CLIP to you.

Prepare Demo Data
-----------------

First, let’s get some texts and download some images. These images are
from `COCO datasets <https://cocodataset.org/#home>`__.

.. |CLIP| image:: https://github.com/openai/CLIP/raw/main/CLIP.png

.. code:: python

    from autogluon.multimodal import download
    
    texts = [
        "A cheetah chases prey on across a field.",
        "A man is eating a piece of bread.",
        "The girl is carrying a baby.",
        "There is an airplane over a car.",
        "A man is riding a horse.",
        "Two men pushed carts through the woods.",
        "There is a carriage in the image.",
        "A man is riding a white horse on an enclosed ground.",
        "A monkey is playing drums.",
    ]
    
    urls = ['http://farm4.staticflickr.com/3179/2872917634_f41e6987a8_z.jpg',
            'http://farm4.staticflickr.com/3629/3608371042_75f9618851_z.jpg',
            'https://farm4.staticflickr.com/3795/9591251800_9c9727e178_z.jpg',
            'http://farm8.staticflickr.com/7188/6848765123_252bfca33d_z.jpg',
            'https://farm6.staticflickr.com/5251/5548123650_1a69ce1e34_z.jpg']
    
    image_paths = [download(url) for url in urls]


.. parsed-literal::
    :class: output

    Downloading 2872917634_f41e6987a8_z.jpg from http://farm4.staticflickr.com/3179/2872917634_f41e6987a8_z.jpg...


.. parsed-literal::
    :class: output

                         
.. parsed-literal::
    :class: output

    Downloading 3608371042_75f9618851_z.jpg from http://farm4.staticflickr.com/3629/3608371042_75f9618851_z.jpg...


.. parsed-literal::
    :class: output

                         
.. parsed-literal::
    :class: output

    Downloading 9591251800_9c9727e178_z.jpg from https://farm4.staticflickr.com/3795/9591251800_9c9727e178_z.jpg...


.. parsed-literal::
    :class: output

                         
.. parsed-literal::
    :class: output

    Downloading 6848765123_252bfca33d_z.jpg from http://farm8.staticflickr.com/7188/6848765123_252bfca33d_z.jpg...


.. parsed-literal::
    :class: output

                         
.. parsed-literal::
    :class: output

    Downloading 5548123650_1a69ce1e34_z.jpg from https://farm6.staticflickr.com/5251/5548123650_1a69ce1e34_z.jpg...


.. parsed-literal::
    :class: output

                         
Extract Embeddings
------------------

We need to use ``image_text_similarity`` as the problem type when
initializing the predictor.

.. code:: python

    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor(problem_type="image_text_similarity")

Let’s extract image and text embeddings separately. The image and text
data will go through their corresponding encoders, respectively.

.. code:: python

    image_embeddings = predictor.extract_embedding(image_paths, as_tensor=True)
    print(image_embeddings.shape)


.. parsed-literal::
    :class: output

    torch.Size([5, 512])


.. code:: python

    text_embeddings = predictor.extract_embedding(texts, as_tensor=True)
    print(text_embeddings.shape)


.. parsed-literal::
    :class: output

    torch.Size([9, 512])


Then you can use the embeddings for a range of tasks such as image
retrieval and text retrieval.

Image Retrieval with Text Query
-------------------------------

Suppose we have a large image database (e.g., video footage), now we
want to retrieve some images defined by a text query. How can we do
this?

It is simple. First, extract all the image embeddings offline as shown
above. Then, extract the text query’s embedding. Finally, compute the
cosine similarities between the text embedding and all the image
embeddings and return the top candidates.

Suppose we use the text below as the query.

.. code:: python

    print(texts[6])


.. parsed-literal::
    :class: output

    There is a carriage in the image.


You can directly call our util function ``semantic_search`` to search
semantically similar images.

.. code:: python

    from autogluon.multimodal.utils import semantic_search
    hits = semantic_search(
            matcher=predictor,
            query_embeddings=text_embeddings[6][None,],
            response_embeddings=image_embeddings,
            top_k=5,
        )
    print(hits)


.. parsed-literal::
    :class: output

    [[{'response_id': 2, 'score': 0.27443650364875793}, {'response_id': 4, 'score': 0.22441968321800232}, {'response_id': 0, 'score': 0.2186582088470459}, {'response_id': 1, 'score': 0.2170213907957077}, {'response_id': 3, 'score': 0.20664750039577484}]]


We can see that we successfully find the image with a carriage in it.

.. code:: python

    from IPython.display import Image, display
    pil_img = Image(filename=image_paths[hits[0][0]["response_id"]])
    display(pil_img)


.. figure:: output_zero_shot_img_txt_matching_fe5797_12_0.jpg


Text Retrieval with Image Query
-------------------------------

Similarly, given one text database and an image query, we can search
texts that match the image. For example, let’s search texts for the
following image.

.. code:: python

    pil_img = Image(filename=image_paths[4])
    display(pil_img)


.. figure:: output_zero_shot_img_txt_matching_fe5797_14_0.jpg


We still use the ``semantic_search`` function, but switch the
assignments of ``query_embeddings`` and ``response_embeddings``.

.. code:: python

    hits = semantic_search(
            matcher=predictor,
            query_embeddings=image_embeddings[4][None,],
            response_embeddings=text_embeddings,
            top_k=5,
        )
    print(hits)


.. parsed-literal::
    :class: output

    [[{'response_id': 3, 'score': 0.2555188238620758}, {'response_id': 6, 'score': 0.2244196981191635}, {'response_id': 7, 'score': 0.18543538451194763}, {'response_id': 2, 'score': 0.1827915459871292}, {'response_id': 4, 'score': 0.17795433104038239}]]


We can observe that the top-1 text matches the query image.

.. code:: python

    texts[hits[0][0]["response_id"]]


.. parsed-literal::
    :class: output

    'There is an airplane over a car.'


Predict Whether Image-Text Pairs Match
--------------------------------------

In addition to retrieval, we can let the predictor tell us whether
image-text pairs match. To do so, we need to initialize the predictor
with the additional arguments ``query`` and ``response``, which
represent names of image/text and text/image.

.. code:: python

    predictor = MultiModalPredictor(
                query="abc",
                response="xyz",
                problem_type="image_text_similarity",
            )

Given image-text pairs, we can make predictions.

.. code:: python

    pred = predictor.predict({"abc": [image_paths[4]], "xyz": [texts[3]]})
    print(pred)


.. parsed-literal::
    :class: output

    [1]


Predict Matching Probabilities
------------------------------

It is also easy to predict the matching probabilities. You can make
predictions by applying customized thresholds to the probabilities.

.. code:: python

    proba = predictor.predict_proba({"abc": [image_paths[4]], "xyz": [texts[3]]})
    print(proba)


.. parsed-literal::
    :class: output

    [[0.3722638 0.6277362]]


Other Examples
--------------

You may go to `AutoMM
Examples <https://github.com/autogluon/autogluon/tree/master/examples/automm>`__
to explore other examples about AutoMM.

Customization
-------------

To learn how to customize AutoMM, please refer to
:ref:`sec_automm_customization`.