Object Detection - Quick Start

Object detection is the process of identifying and localizing objects in an image and is an important task in computer vision. Follow this tutorial to learn how to use AutoGluon for object detection.

Tip: If you are new to AutoGluon, review Image Prediction - Quick Start first to learn the basics of the AutoGluon API.

Our goal is to detect motorbike in images by YOLOv3 model. A tiny dataset is collected from VOC dataset, which only contains the motorbike category. The model pretrained on the COCO dataset is used to fine-tune our small dataset. With the help of AutoGluon, we are able to try many models with different hyperparameters automatically, and return the best one as our final model.

To start, import ObjectDetector:

from autogluon.vision import ObjectDetector
/var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/venv/lib/python3.7/site-packages/gluoncv/__init__.py:40: UserWarning: Both mxnet==1.7.0 and torch==1.9.0+cu102 are installed. You might encounter increased GPU memory footprint if both framework are used at the same time.
  warnings.warn(f'Both mxnet=={mx.__version__} and torch=={torch.__version__} are installed. '

Tiny_motorbike Dataset

We collect a toy dataset for detecting motorbikes in images. From the VOC dataset, images are randomly selected for training, validation, and testing - 120 images for training, 50 images for validation, and 50 for testing. This tiny dataset follows the same format as VOC.

Using the commands below, we can download this dataset, which is only 23M. The name of unzipped folder is called tiny_motorbike. Anyway, the task dataset helper can perform the download and extraction automatically, and load the dataset according to the detection formats.

url = 'https://autogluon.s3.amazonaws.com/datasets/tiny_motorbike.zip'
dataset_train = ObjectDetector.Dataset.from_voc(url, splits='trainval')
tiny_motorbike/
├── Annotations/
├── ImageSets/
└── JPEGImages/

Fit Models by AutoGluon

In this section, we demonstrate how to apply AutoGluon to fit our detection models. We use mobilenet as the backbone for the YOLOv3 model. Two different learning rates are used to fine-tune the network. The best model is the one that obtains the best performance on the validation dataset. You can also try using more networks and hyperparameters to create a larger searching space.

We fit a classifier using AutoGluon as follows. In each experiment (one trial in our searching space), we train the model for 5 epochs to avoid bursting our tutorial runtime.

time_limit = 60*30  # at most 0.5 hour
detector = ObjectDetector()
hyperparameters = {'epochs': 5, 'batch_size': 8}
hyperparameter_tune_kwargs={'num_trials': 2}
detector.fit(dataset_train, time_limit=time_limit, hyperparameters=hyperparameters, hyperparameter_tune_kwargs=hyperparameter_tune_kwargs)
The number of requested GPUs is greater than the number of available GPUs.Reduce the number to 1
Randomly split train_data into train[154]/validation[16] splits.
Starting HPO experiments
  0%|          | 0/2 [00:00<?, ?it/s]
modified configs(<old> != <new>): {
root.dataset_root    ~/.mxnet/datasets/ != auto
root.num_workers     4 != 8
root.ssd.data_shape  300 != 512
root.ssd.base_network vgg16_atrous != resnet50_v1
root.valid.batch_size 16 != 8
root.dataset         voc_tiny != auto
root.gpus            (0, 1, 2, 3) != (0,)
root.train.early_stop_patience -1 != 10
root.train.seed      233 != 160
root.train.batch_size 16 != 8
root.train.epochs    20 != 5
root.train.early_stop_baseline 0.0 != -inf
root.train.early_stop_max_value 1.0 != inf
}
Saved config to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/e3c2ce39/.trial_0/config.yaml
Using transfer learning from ssd_512_resnet50_v1_coco, the other network parameters are ignored.
Start training from [Epoch 0]
[Epoch 0] Training cost: 9.324861, CrossEntropy=3.524562, SmoothL1=1.037478
[Epoch 0] Validation:
boat=nan
motorbike=0.5953810894987366
pottedplant=nan
cow=nan
bus=nan
car=0.36363636363636365
chair=0.0
dog=nan
person=0.5108189351540859
bicycle=0.0
mAP=0.2939672776578372
[Epoch 0] Current best map: 0.293967 vs previous 0.000000, saved to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/e3c2ce39/.trial_0/best_checkpoint.pkl
[Epoch 1] Training cost: 8.550749, CrossEntropy=2.753219, SmoothL1=1.178991
[Epoch 1] Validation:
boat=nan
motorbike=0.6883247383247383
pottedplant=nan
cow=nan
bus=nan
car=0.374331550802139
chair=0.0
dog=nan
person=0.6195196027128799
bicycle=0.0
mAP=0.33643517836795145
[Epoch 1] Current best map: 0.336435 vs previous 0.293967, saved to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/e3c2ce39/.trial_0/best_checkpoint.pkl
[Epoch 2] Training cost: 8.122622, CrossEntropy=2.426636, SmoothL1=1.085962
[Epoch 2] Validation:
boat=nan
motorbike=0.6880695062513243
pottedplant=nan
cow=nan
bus=nan
car=1.0000000000000002
chair=0.0
dog=nan
person=0.5112314585998796
bicycle=0.0
mAP=0.43986019297024087
[Epoch 2] Current best map: 0.439860 vs previous 0.336435, saved to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/e3c2ce39/.trial_0/best_checkpoint.pkl
[Epoch 3] Training cost: 8.737679, CrossEntropy=2.529397, SmoothL1=1.114671
[Epoch 3] Validation:
boat=nan
motorbike=0.7559922533606743
pottedplant=nan
cow=nan
bus=nan
car=1.0000000000000002
chair=0.0
dog=nan
person=0.5920553269973441
bicycle=0.0
mAP=0.4696095160716037
[Epoch 3] Current best map: 0.469610 vs previous 0.439860, saved to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/e3c2ce39/.trial_0/best_checkpoint.pkl
[Epoch 4] Training cost: 8.337297, CrossEntropy=2.329255, SmoothL1=1.036121
[Epoch 4] Validation:
boat=nan
motorbike=0.6208010990619687
pottedplant=nan
cow=nan
bus=nan
car=1.0000000000000002
chair=0.0
dog=nan
person=0.46934081070444705
bicycle=0.0
mAP=0.4180283819532832
Applying the state from the best checkpoint...
modified configs(<old> != <new>): {
root.dataset_root    ~/.mxnet/datasets/ != auto
root.num_workers     4 != 8
root.valid.batch_size 16 != 8
root.dataset         voc_tiny != auto
root.gpus            (0, 1, 2, 3) != (0,)
root.train.batch_size 16 != 8
root.train.early_stop_baseline 0.0 != -inf
root.train.early_stop_max_value 1.0 != inf
root.train.epochs    20 != 5
root.train.early_stop_patience -1 != 10
root.train.seed      233 != 160
}
Saved config to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/e3c2ce39/.trial_1/config.yaml
Using transfer learning from yolo3_darknet53_coco, the other network parameters are ignored.
Start training from [Epoch 0]
[Epoch 0] Training cost: 11.758, ObjLoss=8.384, BoxCenterLoss=7.476, BoxScaleLoss=2.623, ClassLoss=4.206
[Epoch 0] Validation:
boat=nan
motorbike=0.3942629760811579
pottedplant=nan
cow=nan
bus=nan
car=0.7922077922077924
chair=0.0
dog=nan
person=0.7614583558856004
bicycle=0.33333333333333326
mAP=0.45625249150157676
[Epoch 0] Current best map: 0.456252 vs previous 0.000000, saved to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/e3c2ce39/.trial_1/best_checkpoint.pkl
[Epoch 1] Training cost: 13.021, ObjLoss=9.645, BoxCenterLoss=7.930, BoxScaleLoss=2.936, ClassLoss=3.713
[Epoch 1] Validation:
boat=nan
motorbike=0.339339455357762
pottedplant=nan
cow=nan
bus=nan
car=1.0000000000000002
chair=0.0
dog=nan
person=0.2011157856228279
bicycle=0.0
mAP=0.308091048196118
[Epoch 2] Training cost: 16.960, ObjLoss=10.073, BoxCenterLoss=8.048, BoxScaleLoss=3.185, ClassLoss=3.433
[Epoch 2] Validation:
boat=nan
motorbike=0.5557914614657429
pottedplant=nan
cow=nan
bus=nan
car=1.0000000000000002
chair=0.0
dog=nan
person=0.5334890154408871
bicycle=0.0
mAP=0.41785609538132606
[Epoch 3] Training cost: 7.286, ObjLoss=9.740, BoxCenterLoss=7.960, BoxScaleLoss=3.221, ClassLoss=3.281
[Epoch 3] Validation:
boat=nan
motorbike=0.295438147710875
pottedplant=nan
cow=nan
bus=nan
car=0.6363636363636365
chair=0.0
dog=nan
person=0.754971039061948
bicycle=0.0
mAP=0.33735456462729185
[Epoch 4] Training cost: 14.808, ObjLoss=9.630, BoxCenterLoss=7.853, BoxScaleLoss=3.121, ClassLoss=3.040
[Epoch 4] Validation:
boat=nan
motorbike=0.6759823848238482
pottedplant=nan
cow=nan
bus=nan
car=0.8409090909090906
chair=0.0
dog=nan
person=0.5761616161616162
bicycle=0.0
mAP=0.418610618378911
Applying the state from the best checkpoint...
Finished, total runtime is 160.82 s
{ 'best_config': { 'dataset': 'auto',
                   'dataset_root': 'auto',
                   'estimator': <class 'gluoncv.auto.estimators.yolo.yolo.YOLOv3Estimator'>,
                   'gpus': [0],
                   'horovod': False,
                   'num_workers': 8,
                   'resume': '',
                   'save_interval': 10,
                   'save_prefix': '',
                   'train': { 'batch_size': 8,
                              'early_stop_baseline': -inf,
                              'early_stop_max_value': inf,
                              'early_stop_min_delta': 0.001,
                              'early_stop_patience': 10,
                              'epochs': 5,
                              'label_smooth': False,
                              'log_interval': 100,
                              'lr': 0.001,
                              'lr_decay': 0.1,
                              'lr_decay_epoch': (160, 180),
                              'lr_decay_period': 0,
                              'lr_mode': 'step',
                              'mixup': False,
                              'momentum': 0.9,
                              'no_mixup_epochs': 20,
                              'no_wd': False,
                              'num_samples': -1,
                              'seed': 160,
                              'start_epoch': 0,
                              'warmup_epochs': 0,
                              'warmup_lr': 0.0,
                              'wd': 0.0005},
                   'valid': { 'batch_size': 8,
                              'iou_thresh': 0.5,
                              'metric': 'voc07',
                              'val_interval': 1},
                   'yolo3': { 'amp': False,
                              'anchors': ( [10, 13, 16, 30, 33, 23],
                                           [30, 61, 62, 45, 59, 119],
                                           [116, 90, 156, 198, 373, 326]),
                              'base_network': 'darknet53',
                              'data_shape': 416,
                              'filters': (512, 256, 128),
                              'nms_thresh': 0.45,
                              'nms_topk': 400,
                              'no_random_shape': False,
                              'strides': (8, 16, 32),
                              'syncbn': False,
                              'transfer': 'yolo3_darknet53_coco'}},
  'total_time': 160.82177448272705,
  'train_map': 0.6521661560767187,
  'valid_map': 0.45625249150157676}
<autogluon.vision.detector.detector.ObjectDetector at 0x7f97b94e8650>

Note that num_trials=2 above is only used to speed up the tutorial. In normal practice, it is common to only use time_limit and drop num_trials. Also note that hyperparameter tuning defaults to random search. Model-based variants, such as searcher='bayesopt' in hyperparameter_tune_kwargs can be a lot more sample-efficient.

After fitting, AutoGluon automatically returns the best model among all models in the searching space. From the output, we know the best model is the one trained with the second learning rate. To see how well the returned model performed on test dataset, call detector.evaluate().

dataset_test = ObjectDetector.Dataset.from_voc(url, splits='test')

test_map = detector.evaluate(dataset_test)
print("mAP on test dataset: {}".format(test_map[1][-1]))
tiny_motorbike/
├── Annotations/
├── ImageSets/
└── JPEGImages/
mAP on test dataset: 0.02356089856089856

Below, we randomly select an image from test dataset and show the predicted class, box and probability over the origin image, stored in predict_class, predict_rois and predict_score columns, respectively. You can interpret predict_rois as a dict of (xmin, ymin, xmax, ymax) proportional to original image size.

image_path = dataset_test.iloc[0]['image']
result = detector.predict(image_path)
print(result)
   predict_class  predict_score  0      motorbike       0.990474
1         person       0.950992
2            car       0.667868
3      motorbike       0.486178
4      motorbike       0.170904
..           ...            ...
75        person       0.028567
76           car       0.027829
77           car       0.027825
78        person       0.027437
79        person       0.026916

                                         predict_rois
0   {'xmin': 0.31737977266311646, 'ymin': 0.453904...
1   {'xmin': 0.40450236201286316, 'ymin': 0.321747...
2   {'xmin': 0.0030162357725203037, 'ymin': 0.6151...
3   {'xmin': 0.37501904368400574, 'ymin': 0.359075...
4   {'xmin': 0.027219824492931366, 'ymin': 0.03883...
..                                                ...
75  {'xmin': 0.43851685523986816, 'ymin': 0.056298...
76  {'xmin': 0.6897459030151367, 'ymin': 0.4357883...
77  {'xmin': 0.709216833114624, 'ymin': 0.43022984...
78  {'xmin': 0.8530325889587402, 'ymin': 0.0, 'xma...
79  {'xmin': 0.7920027375221252, 'ymin': 0.0103217...

[80 rows x 3 columns]

Prediction with multiple images is permitted:

bulk_result = detector.predict(dataset_test)
print(bulk_result)
     predict_class  predict_score  0        motorbike       0.990474
1           person       0.950992
2              car       0.667868
3        motorbike       0.486178
4        motorbike       0.170904
...            ...            ...
3302        person       0.040091
3303        person       0.039581
3304           car       0.038692
3305        person       0.038228
3306        person       0.038022

                                           predict_rois  0     {'xmin': 0.31737977266311646, 'ymin': 0.453904...
1     {'xmin': 0.40450236201286316, 'ymin': 0.321747...
2     {'xmin': 0.0030162357725203037, 'ymin': 0.6151...
3     {'xmin': 0.37501904368400574, 'ymin': 0.359075...
4     {'xmin': 0.027219824492931366, 'ymin': 0.03883...
...                                                 ...
3302  {'xmin': 0.6349342465400696, 'ymin': 0.1537989...
3303  {'xmin': 0.6123818159103394, 'ymin': 0.2333245...
3304  {'xmin': 0.7683401107788086, 'ymin': 0.6926075...
3305  {'xmin': 0.07734858244657516, 'ymin': 0.426136...
3306  {'xmin': 0.4701172709465027, 'ymin': 0.1686459...

                                                  image
0     /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...
1     /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...
2     /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...
3     /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...
4     /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...
...                                                 ...
3302  /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...
3303  /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...
3304  /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...
3305  /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...
3306  /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...

[3307 rows x 4 columns]

We can also save the trained model, and use it later.

savefile = 'detector.ag'
detector.save(savefile)
new_detector = ObjectDetector.load(savefile)
/var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/venv/lib/python3.7/site-packages/mxnet/gluon/block.py:1512: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
    data: None
  input_sym_arg_type = in_param.infer_type()[0]