.. _image2image_matching:
Image-to-Image Semantic Matching with AutoMM
============================================
Computing the similarity between two images is a common task in computer
vision, with several practical applications such as detecting same or
different product, etc. In general, image similarity models will take
two images as input and transform them into vectors, and then similarity
scores calculated using cosine similarity, dot product, or Euclidean
distances are used to measure how alike or different of the two images.
.. code:: python
import os
import pandas as pd
import warnings
from IPython.display import Image, display
warnings.filterwarnings('ignore')
Prepare your Data
-----------------
In this tutorial, we will demonstrate how to use AutoMM for
image-to-image semantic matching with the simplified Stanford Online
Products dataset
(`SOP `__).
Stanford Online Products dataset is introduced for metric learning.
There are 12 categories of products in this dataset: bicycle, cabinet,
chair, coffee maker, fan, kettle, lamp, mug, sofa, stapler, table and
toaster. Each category has some products, and each product has several
images captured from different views. Here, we consider different views
of the same product as positive pairs (labeled as 1) and images from
different products as negative pairs (labeled as 0).
The following code downloads the dataset and unzip the images and
annotation files.
.. code:: python
download_dir = './ag_automm_tutorial_img2img'
zip_file = 'https://automl-mm-bench.s3.amazonaws.com/Stanford_Online_Products.zip'
from autogluon.core.utils.loaders import load_zip
load_zip.unzip(zip_file, unzip_dir=download_dir)
.. parsed-literal::
:class: output
Downloading ./ag_automm_tutorial_img2img/file.zip from https://automl-mm-bench.s3.amazonaws.com/Stanford_Online_Products.zip...
.. parsed-literal::
:class: output
100%|██████████| 3.08G/3.08G [01:21<00:00, 37.7MiB/s]
Then we can load the annotations into dataframes.
.. code:: python
dataset_path = os.path.join(download_dir, 'Stanford_Online_Products')
train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)
test_data = pd.read_csv(f'{dataset_path}/test.csv', index_col=0)
image_col_1 = "Image1"
image_col_2 = "Image2"
label_col = "Label"
match_label = 1
Here you need to specify the ``match_label``, the label class
representing that a pair semantically match. In this demo dataset, we
use 1 since we assigned 1 to image pairs from the same product. You may
consider your task context to specify ``match_label``.
Next, we expand the image paths since the original paths are relative.
.. code:: python
def path_expander(path, base_folder):
path_l = path.split(';')
return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])
for image_col in [image_col_1, image_col_2]:
train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
The annotations are only image path pairs and their binary labels (1 and
0 mean the image pair matching or not, respectively).
.. code:: python
train_data.head()
.. raw:: html
|
Image1 |
Image2 |
Label |
0 |
/home/ci/autogluon/docs/_build/eval/tutorials/... |
/home/ci/autogluon/docs/_build/eval/tutorials/... |
0 |
1 |
/home/ci/autogluon/docs/_build/eval/tutorials/... |
/home/ci/autogluon/docs/_build/eval/tutorials/... |
1 |
2 |
/home/ci/autogluon/docs/_build/eval/tutorials/... |
/home/ci/autogluon/docs/_build/eval/tutorials/... |
0 |
3 |
/home/ci/autogluon/docs/_build/eval/tutorials/... |
/home/ci/autogluon/docs/_build/eval/tutorials/... |
1 |
4 |
/home/ci/autogluon/docs/_build/eval/tutorials/... |
/home/ci/autogluon/docs/_build/eval/tutorials/... |
1 |
Let’s visualize a matching image pair.
.. code:: python
pil_img = Image(filename=train_data[image_col_1][5])
display(pil_img)
.. figure:: output_image2image_matching_53fb84_11_0.jpg
.. code:: python
pil_img = Image(filename=train_data[image_col_2][5])
display(pil_img)
.. figure:: output_image2image_matching_53fb84_12_0.jpg
Here are two images that do not match.
.. code:: python
pil_img = Image(filename=train_data[image_col_1][0])
display(pil_img)
.. figure:: output_image2image_matching_53fb84_14_0.jpg
.. code:: python
pil_img = Image(filename=train_data[image_col_2][0])
display(pil_img)
.. figure:: output_image2image_matching_53fb84_15_0.jpg
Train your Model
----------------
Ideally, we want to obtain a model that can return high/low scores for
positive/negative image pairs. With AutoMM, we can easily train a model
that captures the semantic relationship between images. Bascially, it
uses `Swin Transformer `__ to project
each image into a high-dimensional vector and compute the cosine
similarity of feature vectors.
With AutoMM, you just need to specify the ``query``, ``response``, and
``label`` column names and fit the model on the training dataset without
worrying the implementation details.
.. code:: python
from autogluon.multimodal import MultiModalPredictor
predictor = MultiModalPredictor(
problem_type="image_similarity",
query=image_col_1, # the column name of the first image
response=image_col_2, # the column name of the second image
label=label_col, # the label column name
match_label=match_label, # the label indicating that query and response have the same semantic meanings.
eval_metric='auc', # the evaluation metric
)
# Fit the model
predictor.fit(
train_data=train_data,
time_limit=180,
)
.. parsed-literal::
:class: output
INFO:pytorch_lightning.utilities.seed:Global seed set to 123
INFO:pytorch_lightning.trainer.connectors.accelerator_connector:Auto select gpus: [0]
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit native Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
| Name | Type | Params
----------------------------------------------------------------------
0 | query_model | TimmAutoModelForImagePrediction | 86.7 M
1 | response_model | TimmAutoModelForImagePrediction | 86.7 M
2 | validation_metric | AUROC | 0
3 | loss_func | ContrastiveLoss | 0
4 | miner_func | PairMarginMiner | 0
----------------------------------------------------------------------
86.7 M Trainable params
0 Non-trainable params
86.7 M Total params
173.486 Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:Epoch 0, global step 15: 'val_roc_auc' reached 0.90233 (best 0.90233), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/matching/AutogluonModels/ag-20221117_032940/epoch=0-step=15.ckpt' as top 3
INFO:pytorch_lightning.utilities.rank_zero:Epoch 0, global step 32: 'val_roc_auc' reached 0.93922 (best 0.93922), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/matching/AutogluonModels/ag-20221117_032940/epoch=0-step=32.ckpt' as top 3
INFO:pytorch_lightning.utilities.rank_zero:Time limit reached. Elapsed time is 0:03:00. Signaling Trainer to stop.
INFO:pytorch_lightning.utilities.rank_zero:Epoch 1, global step 39: 'val_roc_auc' reached 0.94325 (best 0.94325), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/matching/AutogluonModels/ag-20221117_032940/epoch=1-step=39.ckpt' as top 3
.. parsed-literal::
:class: output
Evaluate on Test Dataset
------------------------
You can evaluate the predictor on the test dataset to see how it
performs with the roc_auc score:
.. code:: python
score = predictor.evaluate(test_data)
print("evaluation score: ", score)
.. parsed-literal::
:class: output
evaluation score: {'roc_auc': 0.946946477834744}
Predict on Image Pairs
----------------------
Given new image pairs, we can predict whether they match or not.
.. code:: python
pred = predictor.predict(test_data.head(3))
print(pred)
.. parsed-literal::
:class: output
0 1
1 1
2 1
Name: Label, dtype: int64
The predictions use a naive probability threshold 0.5. That is, we
choose the label with the probability larger than 0.5.
Predict Matching Probabilities
------------------------------
However, you can do more customized thresholding by getting
probabilities.
.. code:: python
proba = predictor.predict_proba(test_data.head(3))
print(proba)
.. parsed-literal::
:class: output
0 1
0 0.368261 0.631739
1 0.047328 0.952672
2 0.112161 0.887839
Extract Embeddings
------------------
You can also extract embeddings for each image of a pair.
.. code:: python
embeddings_1 = predictor.extract_embedding({image_col_1: test_data[image_col_1][:5].tolist()})
print(embeddings_1.shape)
embeddings_2 = predictor.extract_embedding({image_col_2: test_data[image_col_2][:5].tolist()})
print(embeddings_2.shape)
.. parsed-literal::
:class: output
(5, 1024)
(5, 1024)
Other Examples
--------------
You may go to `AutoMM
Examples `__
to explore other examples about AutoMM.
Customization
-------------
To learn how to customize AutoMM, please refer to
:ref:`sec_automm_customization`.