AutoMM Detection - High Performance Finetune on COCO Format Dataset

In this section, our goal is to finetune a high performance model on VOC2017 training set, and evaluate it in VOC2007 test set. Both training and test sets are in COCO format. See AutoMM Detection - Prepare Pascal VOC Dataset for how to prepare VOC dataset, and Convert Data to COCO Format for how to convert other datasets to COCO format.

To start, let’s import MultiModalPredictor:

from autogluon.multimodal import MultiModalPredictor

We select the VFNet with ResNeXt-101 as backbone, Feature Pyramid Network (FPN) as neck, and input resolution is 640x640, pretrained on COCO dataset. (The neck of the object detector refers to the additional layers existing between the backbone and the head. Their role is to collect feature maps from different stages.) With this setting, it sacrifices training and inference time, and also requires much more GPU memory, but the performance is high.

We use val_metric = map, i.e., mean average precision in COCO standard as our validation metric. In previous section AutoMM Detection - Fast Finetune on COCO Format Dataset, we did not specify the validation metric and by default the validation loss is used as validation metric. Using validation loss is much faster but using mean average precision gives the best performance.

While using COCO format dataset, the input is the json annotation file of the dataset split. In this example, voc07_train.json and voc07_test.json are the annotation files of train and test split of VOC2007 dataset. And we use all the GPUs (if any):

checkpoint_name = "vfnet_x101_64x4d_fpn_mdconv_c3-c5_mstrain_2x_coco"
num_gpus = -1  # use all GPUs
val_metric = "map"

train_path = "./VOCdevkit/VOC2007/Annotations/train_cocoformat.json"
test_path = "./VOCdevkit/VOC2007/Annotations/test_cocoformat.json"

We create the MultiModalPredictor with selected checkpoint name, val_metric, and number of GPUs. We need to specify the problem_type to "object_detection", and also provide a sample_data_path for the predictor to infer the catgories of the dataset. Here we provide the train_path, and it also works using any other split of this dataset.

predictor = MultiModalPredictor(
    hyperparameters={
        "model.mmdet_image.checkpoint_name": checkpoint_name,
        "env.num_gpus": num_gpus,
        "optimization.val_metric": val_metric,
    },
    problem_type="object_detection",
    sample_data_path=train_path,
)

If no data sample is available at this point, you can also create the MultiModalPredictor by manually input the classes:

voc_classes = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
predictor = MultiModalPredictor(
    hyperparameters={
        "model.mmdet_image.checkpoint_name": checkpoint_name,
        "env.num_gpus": num_gpus,
        "optimization.val_metric": val_metric,
    },
    problem_type="object_detection",
    classes=voc_classes,
)

We set the learning rate to be 1e-5 and epoch to be 20 for fast finetuning. Note that we use a two-stage learning rate option during finetuning by default, and the model head will have 100x learning rate. Using a two-stage learning rate with high learning rate only on head layers makes the model converge faster during finetuning. It usually gives better performance as well, especially on small datasets with hundreds or thousands of images. We also set the batch_size to be 2, because this model is too huge to run with larger batch size. We also compute the time of the fit process here for better understanding the speed.

import time
start = time.time()
predictor.fit(
    train_path,
    hyperparameters={
        "optimization.learning_rate": 1e-5, # we use two stage and detection head has 100x lr
        "optimization.max_epochs": 20,
        "env.per_gpu_batch_size": 1,  # decrease it when model is large
    },
)
end = time.time()

We run it on a g5dn.12xlarge EC2 machine on AWS, and part of the command outputs are shown below:

Epoch 0:  50%|███████████████████████████████████████████▌                                           | 394/788 [07:42<07:42,  1.17s/it, loss=1.52, v_num=Epoch 0, global step 20: 'val_map' reached 0.61814 (best 0.61814), saving model to '/media/code/autogluon/examples/automm/object_detection/AutogluonModels/ag-20221104_051558/epoch=0-step=20.ckpt' as top 1
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████| 788/788 [15:29<00:00,  1.18s/it, loss=0.982, v_num=Epoch 0, global step 41: 'val_map' reached 0.68742 (best 0.68742), saving model to '/media/code/autogluon/examples/automm/object_detection/AutogluonModels/ag-20221104_051558/epoch=0-step=41.ckpt' as top 1
Epoch 1:  50%|████████████████████████████████████████████                                            | 394/788 [07:54<07:54,  1.20s/it, loss=0.879, v_numEpoch 1, global step 61: 'val_map' reached 0.70111 (best 0.70111), saving model to '/media/code/autogluon/examples/automm/object_detection/AutogluonModels/ag-20221104_051558/epoch=1-step=61.ckpt' as top 1
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████| 788/788 [15:49<00:00,  1.21s/it, loss=0.759, v_num=Epoch 1, global step 82: 'val_map' reached 0.70580 (best 0.70580), saving model to '/media/code/autogluon/examples/automm/object_detection/AutogluonModels/ag-20221104_051558/epoch=1-step=82.ckpt' as top 1
Epoch 2:  50%|████████████████████████████████████████████▌                                            | 394/788 [07:47<07:47,  1.19s/it, loss=1.11, v_num=Epoch 2, global step 102: 'val_map' was not in top 1
Epoch 2: 100%|████████████████████████████████████████████████████████████████████████████████████████| 788/788 [15:29<00:00,  1.18s/it, loss=0.712, v_num=Epoch 2, global step 123: 'val_map' reached 0.71277 (best 0.71277), saving model to '/media/code/autogluon/examples/automm/object_detection/AutogluonModels/ag-20221104_051558/epoch=2-step=123.ckpt' as top 1
Epoch 3:  50%|████████████████████████████████████████████▌                                            | 394/788 [07:38<07:38,  1.16s/it, loss=1.07, v_num=Epoch 3, global step 143: 'val_map' was not in top 1

Notice that at the end of each progress bar, if the checkpoint at current stage is saved, it prints the model’s save path. In this example, it’s /media/code/autogluon/examples/automm/object_detection/AutogluonModels/ag-20221104_051558. You can also specify the save_path like below while creating the MultiModalPredictor.

predictor = MultiModalPredictor(
    save_path="./this_is_a_save_path",
    ...
)

Print out the time and we can see that it takes almost 5 hours.

print("This finetuning takes %.2f seconds." % (end - start))
This finetuning takes 17779.09 seconds.

It does take a lot of time but let’s look at its performance. To evaluate the model we just trained, run:

predictor.evaluate(test_path)

And the evaluation results are shown in command line output. The first value 0.740 is mAP in COCO standard, and the second one 0.932 is mAP in VOC standard (or mAP50). For more details about these metrics, see COCO’s evaluation guideline.

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.740
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.932
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.819
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.483
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.792
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.569
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.811
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.827
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.603
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.754
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.866

Under this high performance finetune setting, it took 5 hours but reached mAP50 = 0.932 on VOC! For how to finetune faster, see AutoMM Detection - Fast Finetune on COCO Format Dataset, where we finetuned a YOLOv3 model with 100 seconds and reached mAP50 = 0.755 on VOC.

Other Examples

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization

To learn how to customize AutoMM, please refer to Customize AutoMM.

Citation

@article{DBLP:journals/corr/abs-2008-13367,
  author    = {Haoyang Zhang and
               Ying Wang and
               Feras Dayoub and
               Niko S{\"{u}}nderhauf},
  title     = {VarifocalNet: An IoU-aware Dense Object Detector},
  journal   = {CoRR},
  volume    = {abs/2008.13367},
  year      = {2020},
  url       = {https://arxiv.org/abs/2008.13367},
  eprinttype = {arXiv},
  eprint    = {2008.13367},
  timestamp = {Wed, 16 Sep 2020 11:20:03 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2008-13367.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}