{ "cells": [ { "cell_type": "markdown", "id": "f33ce061", "metadata": {}, "source": [ "# Image-Text Semantic Matching with AutoMM - Zero-Shot\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/semantic_matching/zero_shot_img_txt_matching.ipynb)\n", "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/semantic_matching/zero_shot_img_txt_matching.ipynb)\n", "\n", "\n", "\n", "The task of image-text semantic matching refers to measuring the visual-semantic similarity between an image and a sentence. AutoMM supports zero-shot image-text matching by leveraging the powerful [CLIP](https://github.com/openai/CLIP). \n", "Thanks to the contrastive loss objective and trained on millions of image-text pairs, CLIP learns good embeddings for both vision and language, and their connections. Hence, we can use it to extract embeddings for retrieval and matching.\n", "\n", "CLIP has a two-tower architecture, which means it has two encoders: one for image, the other for text. An overview of CLIP model can be seen in the diagram below. Left shows its pre-training stage, and Right shows its zero-shot predicton stage. By computing the cosine similarity scores between one image embedding and all the text images, we pick the text which has the highest similarity as the prediction.\n", "\n", "Given the two encoders, we can extract image embeddings, or text embeddings. And most importantly, embedding extraction can be done offline, only similarity computation needs to be done online. So this means good scalability. \n", "![CLIP](https://github.com/openai/CLIP/raw/main/CLIP.png)\n", "\n", "\n", "In this tutorial, we will show how the AutoMM's easy-to-use APIs can ship the powerful CLIP to you.\n", "\n", "## Prepare Demo Data\n", "First, let's get some texts and download some images. These images are from [COCO datasets](https://cocodataset.org/#home)." ] }, { "cell_type": "code", "execution_count": null, "id": "aa00faab-252f-44c9-b8f7-57131aa8251c", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "!pip install autogluon.multimodal\n" ] }, { "cell_type": "code", "execution_count": null, "id": "121f9995", "metadata": {}, "outputs": [], "source": [ "from autogluon.multimodal import download\n", "\n", "texts = [\n", " \"A cheetah chases prey on across a field.\",\n", " \"A man is eating a piece of bread.\",\n", " \"The girl is carrying a baby.\",\n", " \"There is an airplane over a car.\",\n", " \"A man is riding a horse.\",\n", " \"Two men pushed carts through the woods.\",\n", " \"There is a carriage in the image.\",\n", " \"A man is riding a white horse on an enclosed ground.\",\n", " \"A monkey is playing drums.\",\n", "]\n", "\n", "urls = ['http://farm4.staticflickr.com/3179/2872917634_f41e6987a8_z.jpg',\n", " 'http://farm4.staticflickr.com/3629/3608371042_75f9618851_z.jpg',\n", " 'https://farm4.staticflickr.com/3795/9591251800_9c9727e178_z.jpg',\n", " 'http://farm8.staticflickr.com/7188/6848765123_252bfca33d_z.jpg',\n", " 'https://farm6.staticflickr.com/5251/5548123650_1a69ce1e34_z.jpg']\n", "\n", "image_paths = [download(url) for url in urls]" ] }, { "cell_type": "markdown", "id": "c6c8181e", "metadata": {}, "source": [ "## Extract Embeddings\n", "\n", "We need to use `image_text_similarity` as the problem type when initializing the predictor." ] }, { "cell_type": "code", "execution_count": null, "id": "80fffdb3", "metadata": {}, "outputs": [], "source": [ "from autogluon.multimodal import MultiModalPredictor\n", "predictor = MultiModalPredictor(problem_type=\"image_text_similarity\")" ] }, { "cell_type": "markdown", "id": "1ed02a5c", "metadata": {}, "source": [ "Let's extract image and text embeddings separately. The image and text data will go through their corresponding encoders, respectively." ] }, { "cell_type": "code", "execution_count": null, "id": "b29f7181", "metadata": {}, "outputs": [], "source": [ "image_embeddings = predictor.extract_embedding(image_paths, as_tensor=True)\n", "print(image_embeddings.shape)" ] }, { "cell_type": "code", "execution_count": null, "id": "d60ca2f7", "metadata": {}, "outputs": [], "source": [ "text_embeddings = predictor.extract_embedding(texts, as_tensor=True)\n", "print(text_embeddings.shape)" ] }, { "cell_type": "markdown", "id": "6dd78319", "metadata": {}, "source": [ "Then you can use the embeddings for a range of tasks such as image retrieval and text retrieval. \n", "\n", "\n", "## Image Retrieval with Text Query\n", "\n", "Suppose we have a large image database (e.g., video footage), now we want to retrieve some images defined by a text query. How can we do this? \n", "\n", "It is simple. First, extract all the image embeddings offline as shown above. Then, extract the text query's embedding. Finally, compute the cosine similarities between the text embedding and all the image embeddings and return the top candidates. \n", "\n", "Suppose we use the text below as the query." ] }, { "cell_type": "code", "execution_count": null, "id": "e2bc98ca", "metadata": {}, "outputs": [], "source": [ "print(texts[6])" ] }, { "cell_type": "markdown", "id": "a9837b9e", "metadata": {}, "source": [ "You can directly call our util function `semantic_search` to search semantically similar images." ] }, { "cell_type": "code", "execution_count": null, "id": "0c99d1a9", "metadata": {}, "outputs": [], "source": [ "from autogluon.multimodal.utils import semantic_search\n", "hits = semantic_search(\n", " matcher=predictor,\n", " query_embeddings=text_embeddings[6][None,],\n", " response_embeddings=image_embeddings,\n", " top_k=5,\n", " )\n", "print(hits)" ] }, { "cell_type": "markdown", "id": "cb4cb6cf", "metadata": {}, "source": [ "We can see that we successfully find the image with a carriage in it." ] }, { "cell_type": "code", "execution_count": null, "id": "0ce76615", "metadata": {}, "outputs": [], "source": [ "from IPython.display import Image, display\n", "pil_img = Image(filename=image_paths[hits[0][0][\"response_id\"]])\n", "display(pil_img)" ] }, { "cell_type": "markdown", "id": "2364123d", "metadata": {}, "source": [ "## Text Retrieval with Image Query\n", "\n", "Similarly, given one text database and an image query, we can search texts that match the image. For example, let's search texts for the following image." ] }, { "cell_type": "code", "execution_count": null, "id": "bd7ce6f8", "metadata": {}, "outputs": [], "source": [ "pil_img = Image(filename=image_paths[4])\n", "display(pil_img)" ] }, { "cell_type": "markdown", "id": "6ad58d67", "metadata": {}, "source": [ "We still use the `semantic_search` function, but switch the assignments of `query_embeddings` and `response_embeddings`." ] }, { "cell_type": "code", "execution_count": null, "id": "9fb74e9e", "metadata": {}, "outputs": [], "source": [ "hits = semantic_search(\n", " matcher=predictor,\n", " query_embeddings=image_embeddings[4][None,],\n", " response_embeddings=text_embeddings,\n", " top_k=5,\n", " )\n", "print(hits)" ] }, { "cell_type": "markdown", "id": "0c9d0a78", "metadata": {}, "source": [ "We can observe that the top-1 text matches the query image." ] }, { "cell_type": "code", "execution_count": null, "id": "fe53bb67", "metadata": {}, "outputs": [], "source": [ "texts[hits[0][0][\"response_id\"]]" ] }, { "cell_type": "markdown", "id": "c0fcfd6b", "metadata": {}, "source": [ "## Predict Whether Image-Text Pairs Match\n", "In addition to retrieval, we can let the predictor tell us whether image-text pairs match. \n", "To do so, we need to initialize the predictor with the additional arguments `query` and `response`, which represent names of image/text and text/image." ] }, { "cell_type": "code", "execution_count": null, "id": "0166cac5", "metadata": {}, "outputs": [], "source": [ "predictor = MultiModalPredictor(\n", " query=\"abc\",\n", " response=\"xyz\",\n", " problem_type=\"image_text_similarity\",\n", " )" ] }, { "cell_type": "markdown", "id": "e49d9e51", "metadata": {}, "source": [ "Given image-text pairs, we can make predictions." ] }, { "cell_type": "code", "execution_count": null, "id": "f138eae0", "metadata": {}, "outputs": [], "source": [ "pred = predictor.predict({\"abc\": [image_paths[4]], \"xyz\": [texts[3]]})\n", "print(pred)" ] }, { "cell_type": "markdown", "id": "c2853edd", "metadata": {}, "source": [ "## Predict Matching Probabilities\n", "It is also easy to predict the matching probabilities. You can make predictions by applying customized thresholds to the probabilities." ] }, { "cell_type": "code", "execution_count": null, "id": "9c87b9e0", "metadata": {}, "outputs": [], "source": [ "proba = predictor.predict_proba({\"abc\": [image_paths[4]], \"xyz\": [texts[3]]})\n", "print(proba)" ] }, { "cell_type": "markdown", "id": "317ea665", "metadata": {}, "source": [ "## Other Examples\n", "\n", "You may go to [AutoMM Examples](https://github.com/autogluon/autogluon/tree/master/examples/automm) to explore other examples about AutoMM.\n", "\n", "\n", "## Customization\n", "\n", "To learn how to customize AutoMM, please refer to [Customize AutoMM](../advanced_topics/customization.ipynb)." ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }