.. _sec_object_detection_quick:

Object Detection - Quick Start
==============================


Object detection is the process of identifying and localizing objects in
an image and is an important task in computer vision. Follow this
tutorial to learn how to use AutoGluon for object detection.

**Tip**: If you are new to AutoGluon, review :ref:`sec_imgquick` first
to learn the basics of the AutoGluon API.

Our goal is to detect motorbike in images by `YOLOv3
model <https://pjreddie.com/media/files/papers/YOLOv3.pdf>`__. A tiny
dataset is collected from VOC dataset, which only contains the motorbike
category. The model pretrained on the COCO dataset is used to fine-tune
our small dataset. With the help of AutoGluon, we are able to try many
models with different hyperparameters automatically, and return the best
one as our final model.

To start, import ObjectDetector:

.. code:: python

    from autogluon.vision import ObjectDetector


.. parsed-literal::
    :class: output

    /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/venv/lib/python3.7/site-packages/gluoncv/__init__.py:40: UserWarning: Both `mxnet==1.7.0` and `torch==1.9.0+cu102` are installed. You might encounter increased GPU memory footprint if both framework are used at the same time.
      warnings.warn(f'Both `mxnet=={mx.__version__}` and `torch=={torch.__version__}` are installed. '


Tiny\_motorbike Dataset
-----------------------

We collect a toy dataset for detecting motorbikes in images. From the
VOC dataset, images are randomly selected for training, validation, and
testing - 120 images for training, 50 images for validation, and 50 for
testing. This tiny dataset follows the same format as VOC.

Using the commands below, we can download this dataset, which is only
23M. The name of unzipped folder is called ``tiny_motorbike``. Anyway,
the task dataset helper can perform the download and extraction
automatically, and load the dataset according to the detection formats.

.. code:: python

    url = 'https://autogluon.s3.amazonaws.com/datasets/tiny_motorbike.zip'
    dataset_train = ObjectDetector.Dataset.from_voc(url, splits='trainval')


.. parsed-literal::
    :class: output

    tiny_motorbike/
    ├── Annotations/
    ├── ImageSets/
    └── JPEGImages/


Fit Models by AutoGluon
-----------------------

In this section, we demonstrate how to apply AutoGluon to fit our
detection models. We use mobilenet as the backbone for the YOLOv3 model.
Two different learning rates are used to fine-tune the network. The best
model is the one that obtains the best performance on the validation
dataset. You can also try using more networks and hyperparameters to
create a larger searching space.

We ``fit`` a classifier using AutoGluon as follows. In each experiment
(one trial in our searching space), we train the model for 5 epochs to
avoid bursting our tutorial runtime.

.. code:: python

    time_limit = 60*30  # at most 0.5 hour
    detector = ObjectDetector()
    hyperparameters = {'epochs': 5, 'batch_size': 8}
    hyperparameter_tune_kwargs={'num_trials': 2}
    detector.fit(dataset_train, time_limit=time_limit, hyperparameters=hyperparameters, hyperparameter_tune_kwargs=hyperparameter_tune_kwargs)


.. parsed-literal::
    :class: output

    The number of requested GPUs is greater than the number of available GPUs.Reduce the number to 1
    Randomly split train_data into train[152]/validation[18] splits.
    Starting HPO experiments


.. parsed-literal::
    :class: output

      0%|          | 0/2 [00:00<?, ?it/s]


.. parsed-literal::
    :class: output

    modified configs(<old> != <new>): {
    root.train.seed      233 != 188
    root.train.batch_size 16 != 8
    root.train.early_stop_patience -1 != 10
    root.train.epochs    20 != 5
    root.train.early_stop_baseline 0.0 != -inf
    root.train.early_stop_max_value 1.0 != inf
    root.dataset         voc_tiny != auto
    root.valid.batch_size 16 != 8
    root.ssd.data_shape  300 != 512
    root.ssd.base_network vgg16_atrous != resnet50_v1
    root.gpus            (0, 1, 2, 3) != (0,)
    root.dataset_root    ~/.mxnet/datasets/ != auto
    root.num_workers     4 != 8
    }
    Saved config to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/7110f2aa/.trial_0/config.yaml
    Using transfer learning from ssd_512_resnet50_v1_coco, the other network parameters are ignored.
    Start training from [Epoch 0]
    [Epoch 0] Training cost: 9.372814, CrossEntropy=3.422843, SmoothL1=0.970789
    [Epoch 0] Validation: 
    person=0.6336037361653125
    motorbike=0.7388429752066115
    cow=nan
    chair=nan
    pottedplant=0.0
    bicycle=0.0
    bus=1.0000000000000002
    car=1.0000000000000002
    boat=nan
    dog=0.0
    mAP=0.48177810162456064
    [Epoch 0] Current best map: 0.481778 vs previous 0.000000, saved to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/7110f2aa/.trial_0/best_checkpoint.pkl
    [Epoch 1] Training cost: 8.013312, CrossEntropy=2.728228, SmoothL1=1.148340
    [Epoch 1] Validation: 
    person=0.8179817081730957
    motorbike=0.8139361707430133
    cow=nan
    chair=nan
    pottedplant=0.0
    bicycle=0.0
    bus=1.0000000000000002
    car=1.0000000000000002
    boat=nan
    dog=0.33333333333333326
    mAP=0.5664644588927775
    [Epoch 1] Current best map: 0.566464 vs previous 0.481778, saved to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/7110f2aa/.trial_0/best_checkpoint.pkl
    [Epoch 2] Training cost: 8.277046, CrossEntropy=2.254046, SmoothL1=0.981910
    [Epoch 2] Validation: 
    person=0.700187969924812
    motorbike=0.912092957547503
    cow=nan
    chair=nan
    pottedplant=0.0
    bicycle=0.0
    bus=1.0000000000000002
    car=1.0000000000000002
    boat=nan
    dog=0.0
    mAP=0.5160401324960451
    [Epoch 3] Training cost: 8.101155, CrossEntropy=2.234331, SmoothL1=0.996270
    [Epoch 3] Validation: 
    person=0.7145325078816583
    motorbike=0.8005809979494191
    cow=nan
    chair=nan
    pottedplant=0.0
    bicycle=0.0
    bus=1.0000000000000002
    car=1.0000000000000002
    boat=nan
    dog=0.0
    mAP=0.5021590722615825
    [Epoch 4] Training cost: 8.140993, CrossEntropy=2.261475, SmoothL1=0.949625
    [Epoch 4] Validation: 
    person=0.7183485157793459
    motorbike=0.8556343837650554
    cow=nan
    chair=nan
    pottedplant=0.0
    bicycle=0.0
    bus=1.0000000000000002
    car=1.0000000000000002
    boat=nan
    dog=0.0
    mAP=0.5105689856492003
    Applying the state from the best checkpoint...
    modified configs(<old> != <new>): {
    root.train.seed      233 != 188
    root.train.early_stop_patience -1 != 10
    root.train.epochs    20 != 5
    root.train.early_stop_baseline 0.0 != -inf
    root.train.early_stop_max_value 1.0 != inf
    root.train.batch_size 16 != 8
    root.dataset         voc_tiny != auto
    root.valid.batch_size 16 != 8
    root.gpus            (0, 1, 2, 3) != (0,)
    root.dataset_root    ~/.mxnet/datasets/ != auto
    root.num_workers     4 != 8
    }
    Saved config to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/7110f2aa/.trial_1/config.yaml
    Using transfer learning from yolo3_darknet53_coco, the other network parameters are ignored.
    Start training from [Epoch 0]
    [Epoch 0] Training cost: 15.599, ObjLoss=9.696, BoxCenterLoss=8.126, BoxScaleLoss=2.626, ClassLoss=4.826
    [Epoch 0] Validation: 
    person=0.643974227310219
    motorbike=0.7011628893981835
    cow=nan
    chair=nan
    pottedplant=0.0
    bicycle=0.5000000000000001
    bus=1.0000000000000002
    car=0.32057416267942584
    boat=nan
    dog=0.5000000000000001
    mAP=0.5236730399125469
    [Epoch 0] Current best map: 0.523673 vs previous 0.000000, saved to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/7110f2aa/.trial_1/best_checkpoint.pkl
    [Epoch 1] Training cost: 12.816, ObjLoss=9.781, BoxCenterLoss=7.802, BoxScaleLoss=2.691, ClassLoss=3.930
    [Epoch 1] Validation: 
    person=0.740512972865914
    motorbike=0.6893028024606972
    cow=nan
    chair=nan
    pottedplant=0.0
    bicycle=1.0000000000000002
    bus=1.0000000000000002
    car=1.0000000000000002
    boat=nan
    dog=0.11111111111111108
    mAP=0.6487038409196747
    [Epoch 1] Current best map: 0.648704 vs previous 0.523673, saved to /var/lib/jenkins/workspace/workspace/autogluon-tutorial-object-detection-v3/docs/_build/eval/tutorials/object_detection/7110f2aa/.trial_1/best_checkpoint.pkl
    [Epoch 2] Training cost: 13.141, ObjLoss=9.983, BoxCenterLoss=7.779, BoxScaleLoss=2.864, ClassLoss=3.571
    [Epoch 2] Validation: 
    person=0.7642860422405876
    motorbike=0.5028801701976575
    cow=nan
    chair=nan
    pottedplant=0.0
    bicycle=0.33333333333333326
    bus=0.25000000000000006
    car=1.0000000000000002
    boat=nan
    dog=0.5000000000000001
    mAP=0.47864279225308265
    [Epoch 3] Training cost: 10.478, ObjLoss=9.781, BoxCenterLoss=7.755, BoxScaleLoss=2.866, ClassLoss=3.347
    [Epoch 3] Validation: 
    person=0.7978907352480293
    motorbike=0.8500494071146245
    cow=nan
    chair=nan
    pottedplant=0.025000000000000005
    bicycle=0.5000000000000001
    bus=0.0
    car=0.7727272727272726
    boat=nan
    dog=0.0
    mAP=0.4208096307271324
    [Epoch 4] Training cost: 13.389, ObjLoss=9.742, BoxCenterLoss=7.780, BoxScaleLoss=2.925, ClassLoss=3.208
    [Epoch 4] Validation: 
    person=0.8209430919957236
    motorbike=0.8627934661371194
    cow=nan
    chair=nan
    pottedplant=0.0
    bicycle=0.14285714285714288
    bus=1.0000000000000002
    car=1.0000000000000002
    boat=nan
    dog=0.0
    mAP=0.5466562429985694
    Applying the state from the best checkpoint...
    Finished, total runtime is 163.77 s
    { 'best_config': { 'dataset': 'auto',
                       'dataset_root': 'auto',
                       'estimator': <class 'gluoncv.auto.estimators.yolo.yolo.YOLOv3Estimator'>,
                       'gpus': [0],
                       'horovod': False,
                       'num_workers': 8,
                       'resume': '',
                       'save_interval': 10,
                       'save_prefix': '',
                       'train': { 'batch_size': 8,
                                  'early_stop_baseline': -inf,
                                  'early_stop_max_value': inf,
                                  'early_stop_min_delta': 0.001,
                                  'early_stop_patience': 10,
                                  'epochs': 5,
                                  'label_smooth': False,
                                  'log_interval': 100,
                                  'lr': 0.001,
                                  'lr_decay': 0.1,
                                  'lr_decay_epoch': (160, 180),
                                  'lr_decay_period': 0,
                                  'lr_mode': 'step',
                                  'mixup': False,
                                  'momentum': 0.9,
                                  'no_mixup_epochs': 20,
                                  'no_wd': False,
                                  'num_samples': -1,
                                  'seed': 188,
                                  'start_epoch': 0,
                                  'warmup_epochs': 0,
                                  'warmup_lr': 0.0,
                                  'wd': 0.0005},
                       'valid': { 'batch_size': 8,
                                  'iou_thresh': 0.5,
                                  'metric': 'voc07',
                                  'val_interval': 1},
                       'yolo3': { 'amp': False,
                                  'anchors': ( [10, 13, 16, 30, 33, 23],
                                               [30, 61, 62, 45, 59, 119],
                                               [116, 90, 156, 198, 373, 326]),
                                  'base_network': 'darknet53',
                                  'data_shape': 416,
                                  'filters': (512, 256, 128),
                                  'nms_thresh': 0.45,
                                  'nms_topk': 400,
                                  'no_random_shape': False,
                                  'strides': (8, 16, 32),
                                  'syncbn': False,
                                  'transfer': 'yolo3_darknet53_coco'}},
      'total_time': 163.76751351356506,
      'train_map': 0.7782267804566688,
      'valid_map': 0.6487038409196747}


.. parsed-literal::
    :class: output

    <autogluon.vision.detector.detector.ObjectDetector at 0x7f6ee139a850>


Note that ``num_trials=2`` above is only used to speed up the tutorial.
In normal practice, it is common to only use ``time_limit`` and drop
``num_trials``. Also note that hyperparameter tuning defaults to random
search. Model-based variants, such as ``searcher='bayesopt'`` in
``hyperparameter_tune_kwargs`` can be a lot more sample-efficient.

After fitting, AutoGluon automatically returns the best model among all
models in the searching space. From the output, we know the best model
is the one trained with the second learning rate. To see how well the
returned model performed on test dataset, call detector.evaluate().

.. code:: python

    dataset_test = ObjectDetector.Dataset.from_voc(url, splits='test')
    
    test_map = detector.evaluate(dataset_test)
    print("mAP on test dataset: {}".format(test_map[1][-1]))


.. parsed-literal::
    :class: output

    tiny_motorbike/
    ├── Annotations/
    ├── ImageSets/
    └── JPEGImages/
    mAP on test dataset: 0.3416276980042973


Below, we randomly select an image from test dataset and show the
predicted class, box and probability over the origin image, stored in
``predict_class``, ``predict_rois`` and ``predict_score`` columns,
respectively. You can interpret ``predict_rois`` as a dict of (``xmin``,
``ymin``, ``xmax``, ``ymax``) proportional to original image size.

.. code:: python

    image_path = dataset_test.iloc[0]['image']
    result = detector.predict(image_path)
    print(result)


.. parsed-literal::
    :class: output

       predict_class  predict_score  \
    0      motorbike       0.682645   
    1         person       0.577913   
    2            car       0.363752   
    3      motorbike       0.352576   
    4         person       0.258981   
    5         person       0.256710   
    6      motorbike       0.225929   
    7         person       0.200261   
    8      motorbike       0.120902   
    9        bicycle       0.109221   
    10   pottedplant       0.099113   
    11        person       0.099011   
    12        person       0.090721   
    13        person       0.088232   
    14        person       0.070633   
    15        person       0.062771   
    16        person       0.046365   
    17        person       0.043824   
    18       bicycle       0.042624   
    19        person       0.041354   
    20           cow       0.040686   
    21          boat       0.039857   
    22   pottedplant       0.039258   
    23           bus       0.038404   
    24     motorbike       0.038231   
    25         chair       0.036855   
    26     motorbike       0.032093   
    27   pottedplant       0.031982   
    28        person       0.031942   
    29        person       0.029937   
    30           car       0.028760   
    31           dog       0.028587   
    32     motorbike       0.024838   
    33           dog       0.023639   
    34     motorbike       0.023555   
    35         chair       0.023419   
    36        person       0.022975   
    37     motorbike       0.021419   
    38        person       0.020317   
    39        person       0.016776   
    40   pottedplant       0.015502   
    41           dog       0.015088   
    42   pottedplant       0.015062   
    43        person       0.013744   
    44     motorbike       0.011344   
    45     motorbike       0.011245   
    46        person       0.011236   
    47     motorbike       0.011236   
    48        person       0.011068   
    49        person       0.010974   
    50        person       0.010575   
    
                                             predict_rois  
    0   {'xmin': 0.3310595154762268, 'ymin': 0.4464629...  
    1   {'xmin': 0.34560394287109375, 'ymin': 0.347209...  
    2   {'xmin': 0.0, 'ymin': 0.6688785552978516, 'xma...  
    3   {'xmin': 0.0, 'ymin': 0.6157286763191223, 'xma...  
    4   {'xmin': 0.6616300940513611, 'ymin': 0.0, 'xma...  
    5   {'xmin': 0.4548812210559845, 'ymin': 0.0031030...  
    6   {'xmin': 0.007165733724832535, 'ymin': 0.67869...  
    7   {'xmin': 0.057544589042663574, 'ymin': 0.02677...  
    8   {'xmin': 0.35936659574508667, 'ymin': 0.247161...  
    9   {'xmin': 0.3310595154762268, 'ymin': 0.4464629...  
    10  {'xmin': 0.0, 'ymin': 0.6688785552978516, 'xma...  
    11  {'xmin': 0.7704325914382935, 'ymin': 0.0, 'xma...  
    12  {'xmin': 0.6943906545639038, 'ymin': 0.0, 'xma...  
    13  {'xmin': 0.4034964144229889, 'ymin': 0.2719404...  
    14  {'xmin': 0.5255002975463867, 'ymin': 0.0012342...  
    15  {'xmin': 0.7239393591880798, 'ymin': 0.3926926...  
    16  {'xmin': 0.9029600620269775, 'ymin': 0.0302012...  
    17  {'xmin': 0.6395756602287292, 'ymin': 0.0419282...  
    18  {'xmin': 0.0, 'ymin': 0.6688785552978516, 'xma...  
    19  {'xmin': 0.5328963398933411, 'ymin': 0.0, 'xma...  
    20  {'xmin': 0.0, 'ymin': 0.6688785552978516, 'xma...  
    21  {'xmin': 0.0, 'ymin': 0.6688785552978516, 'xma...  
    22  {'xmin': 0.35936659574508667, 'ymin': 0.247161...  
    23  {'xmin': 0.0, 'ymin': 0.6688785552978516, 'xma...  
    24  {'xmin': 0.7239393591880798, 'ymin': 0.3926926...  
    25  {'xmin': 0.0, 'ymin': 0.6157286763191223, 'xma...  
    26  {'xmin': 0.4548812210559845, 'ymin': 0.0031030...  
    27  {'xmin': 0.3310595154762268, 'ymin': 0.4464629...  
    28  {'xmin': 0.6135271787643433, 'ymin': 0.0339585...  
    29  {'xmin': 0.8174579739570618, 'ymin': 0.0, 'xma...  
    30  {'xmin': 0.7729672193527222, 'ymin': 0.0, 'xma...  
    31  {'xmin': 0.3310595154762268, 'ymin': 0.4464629...  
    32  {'xmin': 0.7729672193527222, 'ymin': 0.0, 'xma...  
    33  {'xmin': 0.0, 'ymin': 0.6157286763191223, 'xma...  
    34  {'xmin': 0.056816305965185165, 'ymin': 0.03956...  
    35  {'xmin': 0.007165733724832535, 'ymin': 0.67869...  
    36  {'xmin': 0.9146621227264404, 'ymin': 0.0, 'xma...  
    37  {'xmin': 0.6616300940513611, 'ymin': 0.0, 'xma...  
    38  {'xmin': 0.0, 'ymin': 0.6688785552978516, 'xma...  
    39  {'xmin': 0.5964206457138062, 'ymin': 0.0, 'xma...  
    40  {'xmin': 0.7239393591880798, 'ymin': 0.3926926...  
    41  {'xmin': 0.007165733724832535, 'ymin': 0.67869...  
    42  {'xmin': 0.4548812210559845, 'ymin': 0.0031030...  
    43  {'xmin': 0.2948954403400421, 'ymin': 0.2013196...  
    44  {'xmin': 0.6943906545639038, 'ymin': 0.0, 'xma...  
    45  {'xmin': 0.8174579739570618, 'ymin': 0.0, 'xma...  
    46  {'xmin': 0.03064700961112976, 'ymin': 0.0, 'xm...  
    47  {'xmin': 0.797978937625885, 'ymin': 0.08672408...  
    48  {'xmin': 0.9054626822471619, 'ymin': 0.0, 'xma...  
    49  {'xmin': 0.6799211502075195, 'ymin': 0.0312307...  
    50  {'xmin': 0.9095916748046875, 'ymin': 0.0010530...  


Prediction with multiple images is permitted:

.. code:: python

    bulk_result = detector.predict(dataset_test)
    print(bulk_result)


.. parsed-literal::
    :class: output

         predict_class  predict_score  \
    0        motorbike       0.682645   
    1           person       0.577913   
    2              car       0.363752   
    3        motorbike       0.352576   
    4           person       0.258981   
    ...            ...            ...   
    1857     motorbike       0.011603   
    1858        person       0.011169   
    1859     motorbike       0.010710   
    1860     motorbike       0.010163   
    1861   pottedplant       0.010114   
    
                                               predict_rois  \
    0     {'xmin': 0.3310595154762268, 'ymin': 0.4464629...   
    1     {'xmin': 0.34560394287109375, 'ymin': 0.347209...   
    2     {'xmin': 0.0, 'ymin': 0.6688785552978516, 'xma...   
    3     {'xmin': 0.0, 'ymin': 0.6157286763191223, 'xma...   
    4     {'xmin': 0.6616300940513611, 'ymin': 0.0, 'xma...   
    ...                                                 ...   
    1857  {'xmin': 0.10874426364898682, 'ymin': 0.025177...   
    1858  {'xmin': 0.3966425359249115, 'ymin': 0.3692439...   
    1859  {'xmin': 0.25758716464042664, 'ymin': 0.019422...   
    1860  {'xmin': 0.3919074833393097, 'ymin': 0.0, 'xma...   
    1861  {'xmin': 0.3911811411380768, 'ymin': 0.0282618...   
    
                                                      image  
    0     /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...  
    1     /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...  
    2     /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...  
    3     /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...  
    4     /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...  
    ...                                                 ...  
    1857  /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...  
    1858  /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...  
    1859  /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...  
    1860  /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...  
    1861  /var/lib/jenkins/.gluoncv/datasets/tiny_motorb...  
    
    [1862 rows x 4 columns]


We can also save the trained model, and use it later.

.. code:: python

    savefile = 'detector.ag'
    detector.save(savefile)
    new_detector = ObjectDetector.load(savefile)