Hyperparameter Optimization in AutoMM¶

Hyperparameter optimization (HPO) is a method that helps solve the challenge of tuning hyperparameters of machine learning models. ML algorithms have multiple complex hyperparameters that generate an enormous search space, and the search space in deep learning methods is even larger than traditional ML algorithms. Tuning on a massive search space is a tough challenge, but AutoMM provides various options for you to guide the fitting process based on your domain knowledge and the constraint on computing resources.

Create Image Dataset¶

In this tutorial, we are going to again use the subset of the Shopee-IET dataset from Kaggle for demonstration purpose. Each image contains a clothing item and the corresponding label specifies its clothing category. Our subset of the data contains the following possible labels: BabyPants, BabyShirt, womencasualshoes, womenchiffontop.

We can load a dataset by downloading a url data automatically:

import warnings
warnings.filterwarnings('ignore')
from datetime import datetime

from autogluon.multimodal.utils.misc import shopee_dataset
download_dir = './ag_automm_tutorial_hpo'
train_data, test_data = shopee_dataset(download_dir)
train_data = train_data.sample(frac=0.5)
print(train_data)

Downloading ./ag_automm_tutorial_hpo/file.zip from https://automl-mm-bench.s3.amazonaws.com/vision_datasets/shopee.zip...

100%|██████████| 41.9M/41.9M [00:00<00:00, 64.2MiB/s]

                                                 image  label
/home/ci/autogluon/docs/_build/eval/tutorials/...      3
/home/ci/autogluon/docs/_build/eval/tutorials/...      2
/home/ci/autogluon/docs/_build/eval/tutorials/...      2
/home/ci/autogluon/docs/_build/eval/tutorials/...      2
/home/ci/autogluon/docs/_build/eval/tutorials/...      0
..                                                 ...    ...
/home/ci/autogluon/docs/_build/eval/tutorials/...      0
/home/ci/autogluon/docs/_build/eval/tutorials/...      2
/home/ci/autogluon/docs/_build/eval/tutorials/...      1
/home/ci/autogluon/docs/_build/eval/tutorials/...      3
/home/ci/autogluon/docs/_build/eval/tutorials/...      3

[400 rows x 2 columns]

There are in total 400 data points in this dataset. The image column stores the path to the actual image, and the label column stands for the label class.

The Regular Model Fitting¶

Recall that if we are to use the default settings predefined by Autogluon, we can simply fit the model using MultiModalPredictor with three lines of code:

from autogluon.multimodal import MultiModalPredictor
predictor_regular = MultiModalPredictor(label="label")
start_time = datetime.now()
predictor_regular.fit(
    train_data=train_data,
    hyperparameters = {"model.timm_image.checkpoint_name": "ghostnet_100"}
)
end_time = datetime.now()
elapsed_seconds = (end_time - start_time).total_seconds()
elapsed_min = divmod(elapsed_seconds, 60)
print("Total fitting time: ", f"{int(elapsed_min[0])}m{int(elapsed_min[1])}s")

Global seed set to 123
Downloading: "https://github.com/huawei-noah/CV-backbones/releases/download/ghostnet_pth/ghostnet_1x.pth" to /home/ci/.cache/torch/hub/checkpoints/ghostnet_1x.pth
Auto select gpus: [0]
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                            | Params
----------------------------------------------------------------------
0 | model             | TimmAutoModelForImagePrediction | 3.9 M
1 | validation_metric | Accuracy                        | 0
2 | loss_func         | CrossEntropyLoss                | 0
----------------------------------------------------------------------
3.9 M     Trainable params
0         Non-trainable params
3.9 M     Total params
7.813     Total estimated model params size (MB)
Epoch 0, global step 1: 'val_accuracy' reached 0.30000 (best 0.30000), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=0-step=1.ckpt' as top 3
Epoch 0, global step 3: 'val_accuracy' reached 0.31250 (best 0.31250), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=0-step=3.ckpt' as top 3
Epoch 1, global step 4: 'val_accuracy' reached 0.37500 (best 0.37500), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=1-step=4.ckpt' as top 3
Epoch 1, global step 6: 'val_accuracy' reached 0.42500 (best 0.42500), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=1-step=6.ckpt' as top 3
Epoch 2, global step 7: 'val_accuracy' reached 0.47500 (best 0.47500), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=2-step=7.ckpt' as top 3
Epoch 2, global step 9: 'val_accuracy' reached 0.51250 (best 0.51250), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=2-step=9.ckpt' as top 3
Epoch 3, global step 10: 'val_accuracy' reached 0.47500 (best 0.51250), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=3-step=10.ckpt' as top 3
Epoch 3, global step 12: 'val_accuracy' reached 0.53750 (best 0.53750), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=3-step=12.ckpt' as top 3
Epoch 4, global step 13: 'val_accuracy' reached 0.56250 (best 0.56250), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=4-step=13.ckpt' as top 3
Epoch 4, global step 15: 'val_accuracy' reached 0.56250 (best 0.56250), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=4-step=15.ckpt' as top 3
Epoch 5, global step 16: 'val_accuracy' reached 0.55000 (best 0.56250), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=5-step=16.ckpt' as top 3
Epoch 5, global step 18: 'val_accuracy' reached 0.56250 (best 0.56250), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=5-step=18.ckpt' as top 3
Epoch 6, global step 19: 'val_accuracy' reached 0.57500 (best 0.57500), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=6-step=19.ckpt' as top 3
Epoch 6, global step 21: 'val_accuracy' was not in top 3
Epoch 7, global step 22: 'val_accuracy' reached 0.57500 (best 0.57500), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=7-step=22.ckpt' as top 3
Epoch 7, global step 24: 'val_accuracy' reached 0.58750 (best 0.58750), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=7-step=24.ckpt' as top 3
Epoch 8, global step 25: 'val_accuracy' reached 0.58750 (best 0.58750), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=8-step=25.ckpt' as top 3
Epoch 8, global step 27: 'val_accuracy' was not in top 3
Epoch 9, global step 28: 'val_accuracy' reached 0.60000 (best 0.60000), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014013/epoch=9-step=28.ckpt' as top 3
Epoch 9, global step 30: 'val_accuracy' was not in top 3
Trainer.fit stopped: max_epochs=10 reached.

Total fitting time:  0m46s

Let’s check out the test accuracy of the fitted model:

scores = predictor_regular.evaluate(test_data, metrics=["accuracy"])
print('Top-1 test acc: %.3f' % scores["accuracy"])

Top-1 test acc: 0.575

Use HPO During Model Fitting¶

If you would like more control over the fitting process, you can specify various options for hyperparameter optimizations(HPO) in MultiModalPredictor by simply adding more options in hyperparameter and hyperparameter_tune_kwargs.

There are a few options we can have in MultiModalPredictor. We use Ray Tune tune library in the backend, so we need to pass in a Tune search space or an AutoGluon search space which will be converted to Tune search space.

Defining the search space of various hyperparameter values for the training of neural networks:

hyperparameters = {
        "optimization.learning_rate": tune.uniform(0.00005, 0.005),
        "optimization.optim_type": tune.choice(["adamw", "sgd"]),
        "optimization.max_epochs": tune.choice(["10", "20"]),
        "model.timm_image.checkpoint_name": tune.choice(["swin_base_patch4_window7_224", "convnext_base_in22ft1k"])
        }

This is an example but not an exhaustive list. You can find the full supported list in Customize AutoMM

Defining the search strategy for HPO with hyperparameter_tune_kwargs. You can pass in a string or initialize a ray.tune.schedulers.TrialScheduler object.

Specifying how to search through your chosen hyperparameter space (supports random and bayes):

"searcher": "bayes"

Specifying how to schedule jobs to train a network under a particular hyperparameter configuration (supports FIFO and ASHA):

"scheduler": "ASHA"

Number of trials you would like to carry out HPO:

"num_trials": 20

Let’s work on HPO with combinations of different learning rates and backbone models:

from ray import tune

predictor_hpo = MultiModalPredictor(label="label")

hyperparameters = {
            "optimization.learning_rate": tune.uniform(0.00005, 0.001),
            "model.timm_image.checkpoint_name": tune.choice(["ghostnet_100",
                                                             "mobilenetv3_large_100"])
}
hyperparameter_tune_kwargs = {
    "searcher": "bayes", # random
    "scheduler": "ASHA",
    "num_trials": 2,
}
start_time_hpo = datetime.now()
predictor_hpo.fit(
        train_data=train_data,
        hyperparameters=hyperparameters,
        hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
    )
end_time_hpo = datetime.now()
elapsed_seconds_hpo = (end_time_hpo - start_time_hpo).total_seconds()
elapsed_min_hpo = divmod(elapsed_seconds_hpo, 60)
print("Total fitting time: ", f"{int(elapsed_min_hpo[0])}m{int(elapsed_min_hpo[1])}s")

Global seed set to 123
/home/ci/opt/venv/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py:642: DeprecationWarning: checkpoint_dir in func(config, checkpoint_dir) is being deprecated. To save and load checkpoint in trainable functions, please use the ray.air.session API:

from ray.air import session

def train(config):
    # ...
    session.report({"metric": metric}, checkpoint=checkpoint)

For more information please see https://docs.ray.io/en/master/ray-air/key-concepts.html#session

  warnings.warn(

== Status ==
Current time: 2022-12-13 01:42:03 (running for 00:00:58.28)
Memory usage on this node: 7.0/31.0 GiB
Using AsyncHyperBand: num_stopped=1 Bracket: Iter 4096.000: None | Iter 1024.000: None | Iter 256.000: None | Iter 64.000: None | Iter 16.000: 0.824999988079071 | Iter 4.000: 0.6937499791383743 | Iter 1.000: 0.25
Resources requested: 0/8 CPUs, 0/1 GPUs, 0.0/13.64 GiB heap, 0.0/6.82 GiB objects (0.0/1.0 accelerator_type:T4)
Current best trial: 351ac716 with val_accuracy=0.824999988079071 and parameters={'optimization.learning_rate': 0.000579447039503473, 'model.timm_image.checkpoint_name': 'ghostnet_100'}
Result logdir: /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221213_014102
Number of trials: 2/2 (2 TERMINATED)

Trial name	status	loc	model.timm_image....	optimization.lear...	iter	total time (s)	val_accuracy
351ac716	TERMINATED	10.0.0.161:2409	ghostnet_100	0.000579447	19	37.6808	0.825
3d4a3908	TERMINATED	10.0.0.161:2409	mobilenetv3_lar_9080	0.00029918	4	6.55206	0.6375

Trial 351ac716 reported val_accuracy=0.25 with parameters={'optimization.learning_rate': 0.000579447039503473, 'model.timm_image.checkpoint_name': 'ghostnet_100'}.
Trial 351ac716 reported val_accuracy=0.71 with parameters={'optimization.learning_rate': 0.000579447039503473, 'model.timm_image.checkpoint_name': 'ghostnet_100'}.
Trial 351ac716 reported val_accuracy=0.77 with parameters={'optimization.learning_rate': 0.000579447039503473, 'model.timm_image.checkpoint_name': 'ghostnet_100'}.
Trial 351ac716 reported val_accuracy=0.82 with parameters={'optimization.learning_rate': 0.000579447039503473, 'model.timm_image.checkpoint_name': 'ghostnet_100'}.
Trial 351ac716 reported val_accuracy=0.81 with parameters={'optimization.learning_rate': 0.000579447039503473, 'model.timm_image.checkpoint_name': 'ghostnet_100'}.
Trial 351ac716 reported val_accuracy=0.82 with parameters={'optimization.learning_rate': 0.000579447039503473, 'model.timm_image.checkpoint_name': 'ghostnet_100'}.
Trial 351ac716 reported val_accuracy=0.82 with parameters={'optimization.learning_rate': 0.000579447039503473, 'model.timm_image.checkpoint_name': 'ghostnet_100'}.
Trial 351ac716 completed. Last result: val_accuracy=0.824999988079071,should_checkpoint=True
Trial 3d4a3908 reported val_accuracy=0.25 with parameters={'optimization.learning_rate': 0.0002991795154254725, 'model.timm_image.checkpoint_name': 'mobilenetv3_large_100'}.
Trial 3d4a3908 reported val_accuracy=0.64 with parameters={'optimization.learning_rate': 0.0002991795154254725, 'model.timm_image.checkpoint_name': 'mobilenetv3_large_100'}. This trial completed.
Total fitting time:  1m6s

Let’s check out the test accuracy of the fitted model after HPO:

scores_hpo = predictor_hpo.evaluate(test_data, metrics=["accuracy"])
print('Top-1 test acc: %.3f' % scores_hpo["accuracy"])

Top-1 test acc: 0.850

From the training log, you should be able to see the current best trial as below:

Current best trial: 47aef96a with val_accuracy=0.862500011920929 and parameters={'optimization.learning_rate': 0.0007195214018085505, 'model.timm_image.checkpoint_name': 'ghostnet_100'}

After our simple 2-trial HPO run, we got a better test accuracy, by searching different learning rates and models, compared to the out-of-box solution provided in the previous section. HPO helps select the combination of hyperparameters with highest validation accuracy.

Other Examples¶

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization¶

To learn how to customize AutoMM, please refer to Customize AutoMM.