AutoMM Detection - Fast Finetune on COCO Format Dataset

In this section, our goal is to fast finetune a pretrained model on VOC2017 training set, and evaluate it in VOC2007 test set. Both training and test sets are in COCO format. See AutoMM Detection - Prepare Pascal VOC Dataset for how to prepare VOC dataset, and Convert Data to COCO Format for how to convert other datasets to COCO format.

To start, let’s import MultiModalPredictor:

from autogluon.multimodal import MultiModalPredictor

We select the YOLOv3 with MobileNetV2 as backbone, and input resolution is 320x320, pretrained on COCO dataset. With this setting, it is fast to finetune or inference, and easy to deploy. While using COCO format dataset, the input is the json annotation file of the dataset split. In this example, voc07_train.json and voc07_test.json are the annotation files of train and test split of VOC2007 dataset. And we use all the GPUs (if any):

checkpoint_name = "yolov3_mobilenetv2_320_300e_coco"
num_gpus = -1  # use all GPUs

train_path = "./VOCdevkit/VOC2007/Annotations/train_cocoformat.json"
test_path = "./VOCdevkit/VOC2007/Annotations/test_cocoformat.json"

We create the MultiModalPredictor with selected checkpoint name and number of GPUs. We need to specify the problem_type to "object_detection", and also provide a sample_data_path for the predictor to infer the catgories of the dataset. Here we provide the train_path, and it also works using any other split of this dataset.

predictor = MultiModalPredictor(
    hyperparameters={
        "model.mmdet_image.checkpoint_name": checkpoint_name,
        "env.num_gpus": num_gpus,
    },
    problem_type="object_detection",
    sample_data_path=train_path,
)

If no data sample is available at this point, you can also create the MultiModalPredictor by manually input the classes:

voc_classes = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
predictor = MultiModalPredictor(
    hyperparameters={
        "model.mmdet_image.checkpoint_name": checkpoint_name,
        "env.num_gpus": num_gpus,
    },
    problem_type="object_detection",
    classes=voc_classes,
)

We set the learning rate to be 1e-4. Note that we use a two-stage learning rate option during finetuning by default, and the model head will have 100x learning rate. Using a two-stage learning rate with high learning rate only on head layers makes the model converge faster during finetuning. It usually gives better performance as well, especially on small datasets with hundreds or thousands of images. We also set the epoch to be 5 for fast finetuning and batch_size to be 32. We also compute the time of the fit process here for better understanding the speed.

import time
start = time.time()
predictor.fit(
    train_path,
    hyperparameters={
        "optimization.learning_rate": 1e-4, # we use two stage and detection head has 100x lr
        "optimization.max_epochs": 5,
        "env.per_gpu_batch_size": 32,  # decrease it when model is large
    },
)
end = time.time()

We run it on a g5dn.12xlarge EC2 machine on AWS, and part of the command outputs are shown below:

Epoch 0:  98%|██████████████████████████████████████████████████████████████████████████████████████████▏ | 50/51 [00:15<00:00,  3.19it/s, loss=766, v_num=Epoch 0, global step 40: 'val_direct_loss' reached 555.37537 (best 555.37537), saving model to '/media/code/autogluon/examples/automm/object_detection/AutogluonModels/ag-20221104_185342/epoch=0-step=40.ckpt' as top 1
Epoch 1:  49%|█████████████████████████████████████████████                                               | 25/51 [00:08<00:08,  3.01it/s, loss=588, v_num=Epoch 1, global step 61: 'val_direct_loss' reached 499.56232 (best 499.56232), saving model to '/media/code/autogluon/examples/automm/object_detection/AutogluonModels/ag-20221104_185342/epoch=1-step=61.ckpt' as top 1
Epoch 1:  98%|██████████████████████████████████████████████████████████████████████████████████████████▏ | 50/51 [00:15<00:00,  3.17it/s, loss=554, v_num=Epoch 1, global step 81: 'val_direct_loss' reached 481.33121 (best 481.33121), saving model to '/media/code/autogluon/examples/automm/object_detection/AutogluonModels/ag-20221104_185342/epoch=1-step=81.ckpt' as top 1
Epoch 2:  49%|█████████████████████████████████████████████                                               | 25/51 [00:08<00:08,  2.99it/s, loss=539, v_num=Epoch 2, global step 102: 'val_direct_loss' reached 460.25449 (best 460.25449), saving model to '/media/code/autogluon/examples/automm/object_detection/AutogluonModels/ag-20221104_185342/epoch=2-step=102.ckpt' as top 1
Epoch 2:  98%|██████████████████████████████████████████████████████████████████████████████████████████▏ | 50/51 [00:15<00:00,  3.15it/s, loss=539, v_num=Epoch 2, global step 122: 'val_direct_loss' was not in top 1
Epoch 3:  49%|█████████████████████████████████████████████                                               | 25/51 [00:08<00:08,  2.96it/s, loss=533, v_num=Epoch 3, global step 143: 'val_direct_loss' was not in top 1
Epoch 3:  88%|█████████████████████████████████████████████████████████████████████████████████▏          | 45/51 [00:14<00:01,  3.17it/s, loss=508, v_num=]

Notice that at the end of each progress bar, if the checkpoint at current stage is saved, it prints the model’s save path. In this example, it’s /media/code/autogluon/examples/automm/object_detection/AutogluonModels/ag-20221104_185342. You can also specify the save_path like below while creating the MultiModalPredictor.

predictor = MultiModalPredictor(
    save_path="./this_is_a_save_path",
    ...
)

Print out the time and we can see that it only takes 100.42 seconds!

print("This finetuning takes %.2f seconds." % (end - start))
This finetuning takes 100.42 seconds.

To evaluate the model we just trained, run:

predictor.evaluate(test_path)

And the evaluation results are shown in command line output. The first value 0.375 is mAP in COCO standard, and the second one 0.755 is mAP in VOC standard (or mAP50). For more details about these metrics, see COCO’s evaluation guideline.

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.375
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.755
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.311
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.111
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.230
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.431
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.355
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.505
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.515
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.258
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.415
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.556

Under this fast finetune setting, we reached mAP50 = 0.755 on VOC with 100 seconds! For how to finetune with higher performance, see AutoMM Detection - High Performance Finetune on COCO Format Dataset, where we finetuned a VFNet model with 5 hours and reached mAP50 = 0.932 on VOC.

Other Examples

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization

To learn how to customize AutoMM, please refer to Customize AutoMM.

Citation

@misc{redmon2018yolov3,
    title={YOLOv3: An Incremental Improvement},
    author={Joseph Redmon and Ali Farhadi},
    year={2018},
    eprint={1804.02767},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}