.. _sec_automm_clip_embedding: CLIP in AutoMM - Extract Embeddings =================================== We have shown CLIP’s amazing capability in performing zero-shot image classification in our previous tutorial :ref:`sec_automm_clip_zeroshot_imgcls`. Thanks to the contrastive loss objective and trained on millions of image-text pairs, CLIP learns good embeddings for both vision and language, and their connections. Hence, another important use case of CLIP is to extract embeddings for retrieval, matching, ranking kind of tasks. In this tutorial, we will show you how to use AutoGluon to extract embeddings from CLIP, and then use it for a retrieval problem. Extract Embeddings ------------------ CLIP has a two-tower architecture, which means it has two encoders: one for image, the other for text. An overview of CLIP model can be seen in the diagram below. Left shows its pre-training stage, and Right shows its zero-shot predicton stage. By computing the cosine similarity scores between one image embedding and all the text images, we pick the text which has the highest similarity as the prediction. .. figure:: https://github.com/openai/CLIP/raw/main/CLIP.png CLIP Given the two encoders, we can extract image embeddings, or text embeddings. And most importantly, embedding extraction can be done offline, only similarity computation needs to be done online. So this means good scalability. First, let’s download some images. These images are from `COCO datasets `__. .. code:: python from autogluon.multimodal import download urls = ['http://farm4.staticflickr.com/3179/2872917634_f41e6987a8_z.jpg', 'http://farm4.staticflickr.com/3629/3608371042_75f9618851_z.jpg', 'https://farm4.staticflickr.com/3795/9591251800_9c9727e178_z.jpg', 'http://farm8.staticflickr.com/7188/6848765123_252bfca33d_z.jpg', 'https://farm6.staticflickr.com/5251/5548123650_1a69ce1e34_z.jpg'] image_paths = [download(url) for url in urls] print(image_paths) .. parsed-literal:: :class: output Downloading 2872917634_f41e6987a8_z.jpg from http://farm4.staticflickr.com/3179/2872917634_f41e6987a8_z.jpg... .. parsed-literal:: :class: output .. parsed-literal:: :class: output Downloading 3608371042_75f9618851_z.jpg from http://farm4.staticflickr.com/3629/3608371042_75f9618851_z.jpg... .. parsed-literal:: :class: output .. parsed-literal:: :class: output Downloading 9591251800_9c9727e178_z.jpg from https://farm4.staticflickr.com/3795/9591251800_9c9727e178_z.jpg... .. parsed-literal:: :class: output .. parsed-literal:: :class: output Downloading 6848765123_252bfca33d_z.jpg from http://farm8.staticflickr.com/7188/6848765123_252bfca33d_z.jpg... .. parsed-literal:: :class: output .. parsed-literal:: :class: output Downloading 5548123650_1a69ce1e34_z.jpg from https://farm6.staticflickr.com/5251/5548123650_1a69ce1e34_z.jpg... .. parsed-literal:: :class: output .. parsed-literal:: :class: output ['2872917634_f41e6987a8_z.jpg', '3608371042_75f9618851_z.jpg', '9591251800_9c9727e178_z.jpg', '6848765123_252bfca33d_z.jpg', '5548123650_1a69ce1e34_z.jpg'] .. parsed-literal:: :class: output Let’s extract some image embedding from the CLIP vision encoder, .. code:: python from autogluon.multimodal import MultiModalPredictor predictor = MultiModalPredictor(hyperparameters={"model.names": ["clip"]}, problem_type="zero_shot") # extract image embeddings. image_embeddings = predictor.extract_embedding({"image": image_paths}) print(image_embeddings['image'].shape) # image (5, 768) .. parsed-literal:: :class: output Downloading: 0%| | 0.00/4.42k [00:00`__ to explore other examples about AutoMM. Customization ------------- To learn how to customize AutoMM, please refer to :ref:`sec_automm_customization`.