.. _sec_automm_distillation_multilingual:

Knowledge Distillation in AutoMM
================================


Pretrained foundation models are becoming increasingly large. However,
these models are difficult to deploy due to limited resources available
in deployment scenarios. To benefit from large models under this
constraint, you transfer the knowledge from the large-scale teacher
models to the student model, with knowledge distillation. In this way,
the small student model can be practically deployed under real-world
scenarios, while the performance will be better than training the
student model from scratch thanks to the teacher.

In this tutorial, we introduce how to adopt ``MultiModalPredictor`` for
knowledge distillation. For the purpose of demonstration, we use the
`Question-answering NLI <https://paperswithcode.com/dataset/qnli>`__
dataset, which comprises 104,743 question, answer pairs sampled from
question answering datasets. We will demonstrate how to use a large
model to guide the learning and improve the performance of a small model
in AutoGluon.

Load Dataset
------------

The `Question-answering NLI <https://paperswithcode.com/dataset/qnli>`__
dataset contains sentence pairs in English. In the label column, ``0``
means that the sentence is not related to the question and ``1`` means
that the sentence is related to the question.

.. code:: python

    import datasets
    from datasets import load_dataset
    
    datasets.logging.disable_progress_bar()
    
    dataset = load_dataset("glue", "qnli")


.. parsed-literal::
    :class: output

    Downloading and preparing dataset glue/qnli to /home/ci/.cache/huggingface/datasets/glue/qnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...
    Dataset glue downloaded and prepared to /home/ci/.cache/huggingface/datasets/glue/qnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


.. code:: python

    dataset['train']


.. parsed-literal::
    :class: output

    Dataset({
        features: ['question', 'sentence', 'label', 'idx'],
        num_rows: 104743
    })


.. code:: python

    from sklearn.model_selection import train_test_split
    
    train_valid_df = dataset["train"].to_pandas()[["question", "sentence", "label"]].sample(1000, random_state=123)
    train_df, valid_df = train_test_split(train_valid_df, test_size=0.2, random_state=123)
    test_df = dataset["validation"].to_pandas()[["question", "sentence", "label"]].sample(1000, random_state=123)

Load the Teacher Model
----------------------

In our example, we will directly load a teacher model with the
`google/bert_uncased_L-12_H-768_A-12 <https://huggingface.co/google/bert_uncased_L-12_H-768_A-12>`__
backbone that has been trained on QNLI and distill it into a student
model with the
`google/bert_uncased_L-6_H-768_A-12 <https://huggingface.co/google/bert_uncased_L-6_H-768_A-12>`__
backbone.

.. code:: python

    !wget --quiet https://automl-mm-bench.s3.amazonaws.com/unit-tests/distillation_sample_teacher.zip -O distillation_sample_teacher.zip
    !unzip -q -o distillation_sample_teacher.zip -d .

.. code:: python

    from autogluon.multimodal import MultiModalPredictor
    
    teacher_predictor = MultiModalPredictor.load("ag_distillation_sample_teacher/")


.. parsed-literal::
    :class: output

    /home/ci/opt/venv/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator LabelEncoder from version 1.0.2 when using version 1.1.3. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
    https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
      warnings.warn(
    /home/ci/opt/venv/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator StandardScaler from version 1.0.2 when using version 1.1.3. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
    https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
      warnings.warn(


Distill to Student
------------------

Training the student model is straight forward. You may just add the
``teacher_predictor`` argument when calling ``.fit()``. Internally, the
student will be trained by matching the prediction/feature map from the
teacher. It can perform better than directly finetuning the student.

.. code:: python

    student_predictor = MultiModalPredictor(label="label")
    student_predictor.fit(
        train_df,
        tuning_data=valid_df,
        teacher_predictor=teacher_predictor,
        hyperparameters={
            "model.hf_text.checkpoint_name": "google/bert_uncased_L-6_H-768_A-12",
            "optimization.max_epochs": 2,
        }
    )


.. parsed-literal::
    :class: output

    Global seed set to 123
    /home/ci/opt/venv/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `AUROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
      warnings.warn(*args, **kwargs)
    /home/ci/opt/venv/lib/python3.8/site-packages/pytorch_lightning/utilities/parsing.py:268: UserWarning: Attribute 'softmax_regression_loss_func' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['softmax_regression_loss_func'])`.
      rank_zero_warn(
    Auto select gpus: [0]
    Using 16bit native Automatic Mixed Precision (AMP)
    GPU available: True (cuda), used: True
    TPU available: False, using: 0 TPU cores
    IPU available: False, using: 0 IPUs
    HPU available: False, using: 0 HPUs
    LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
    
      | Name                         | Type                         | Params
    ------------------------------------------------------------------------------
    0 | student_model                | HFAutoModelForTextPrediction | 67.0 M
    1 | teacher_model                | HFAutoModelForTextPrediction | 109 M 
    2 | validation_metric            | AUROC                        | 0     
    3 | hard_label_loss_func         | CrossEntropyLoss             | 0     
    4 | soft_label_loss_func         | CrossEntropyLoss             | 0     
    5 | softmax_regression_loss_func | MSELoss                      | 0     
    6 | output_feature_loss_func     | MSELoss                      | 0     
    7 | output_feature_adaptor       | Identity                     | 0     
    8 | rkd_loss_func                | RKDLoss                      | 0     
    ------------------------------------------------------------------------------
    176 M     Trainable params
    0         Non-trainable params
    176 M     Total params
    352.881   Total estimated model params size (MB)
    Epoch 0, global step 3: 'val_roc_auc' reached 0.63572 (best 0.63572), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014305/epoch=0-step=3.ckpt' as top 3
    Epoch 0, global step 7: 'val_roc_auc' reached 0.69998 (best 0.69998), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014305/epoch=0-step=7.ckpt' as top 3
    Epoch 1, global step 10: 'val_roc_auc' reached 0.70933 (best 0.70933), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014305/epoch=1-step=10.ckpt' as top 3
    Epoch 1, global step 14: 'val_roc_auc' reached 0.71219 (best 0.71219), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014305/epoch=1-step=14.ckpt' as top 3
    `Trainer.fit` stopped: `max_epochs=2` reached.


.. parsed-literal::
    :class: output

    <autogluon.multimodal.predictor.MultiModalPredictor at 0x7fda035f8280>


.. code:: python

    print(student_predictor.evaluate(data=test_df))


.. parsed-literal::
    :class: output

    {'roc_auc': 0.7905329444571136}


More about Knowledge Distillation
---------------------------------

To learn how to customize distillation and how it compares with direct
finetuning, see the distillation examples and README in `AutoMM
Distillation
Examples <https://github.com/autogluon/autogluon/tree/master/examples/automm/distillation>`__.
Especially the `multilingual distillation
example <https://github.com/autogluon/autogluon/tree/master/examples/automm/distillation/automm_distillation_pawsx.py>`__
with more details and customization.

Other Examples
--------------

You may go to `AutoMM
Examples <https://github.com/autogluon/autogluon/tree/master/examples/automm>`__
to explore other examples about AutoMM.

Customization
-------------

To learn how to customize AutoMM, please refer to
:ref:`sec_automm_customization`.