Image Classification - How to Use Your Own Datasets¶

This tutorial demonstrates how to use AutoGluon with your own custom datasets. As an example, we use a dataset from Kaggle to show the required steps to format image data properly for AutoGluon.

Step 1: Organizing the dataset into proper directories¶

After completing this step, you will have the following directory structure on your machine:

data/
├── class1/
├── class2/
├── class3/
├── ...

Here data is a folder containing the raw images categorized into classes. For example, subfolder class1 contains all images that belong to the first class, class2 contains all images belonging to the second class, etc. We generally recommend at least 100 training images per class for reasonable classification performance, but this might depend on the type of images in your specific use-case.

Under each class, the following image formats are supported when training your model:

JPG
JPEG
PNG

In the same dataset, all the images should be in the same format. Note that in image classification, we do not require that all images have the same resolution.

You will need to organize your dataset into the above directory structure before using AutoGluon. Below, we demonstrate how to construct this organization for a Kaggle dataset.

Example: Kaggle dataset¶

Kaggle is a popular machine learning competition platform and contains lots of datasets for different machine learning tasks including image classification. If you don’t have Kaggle account, please register one at Kaggle. Then, please follow the Kaggle installation to obtain access to Kaggle’s data downloading API.

To find image classification datasets in Kaggle, let’s go to Kaggle and search using keyword image classification either under Datasets or Competitions.

For example, we find the Shopee-IET Machine Learning Competition under the InClass tab in Competitions.

We then navigate to Data to download the dataset using the Kaggle API. Please make sure to click the button of “I Understand and Accept” before downloading the data.

An example shell script to download the dataset to ./data/shopeeiet/ can be found here: download_shopeeiet.sh.

After downloading this script to your machine, run it with:

import autogluon as ag
ag.download('https://raw.githubusercontent.com/zhanghang1989/AutoGluonWebdata/master/docs/tutorial/download_shopeeiet.sh')
!sh download_shopeeiet.sh

Now we have the desired directory structure under ./data/shopeeiet/train/, which in this case looks as follows:

shopeeiet/train
    ├── BabyBibs
    ├── BabyHat
    ├── BabyPants
    ├── ...
shopeeiet/test
    ├── ...

Here are some example images from this data:

Step 2: Split data into training/validation sets¶

A fundamental step in machine learning is to split the data into disjoint sets used for different purposes.

Training Set: The majority of your data should be in the training set. This is the data use to train your model: data is used to learn the parameters of the model, namely the weights of the neural network classifier.

Validation Set: A separate validation set (sometimes also called the dev set) is also used during AutoGluon’s training process. While neural network weights are updated based on the training data, each neural network requires the user to specify many hyperparameters (e.g., learning rates, etc.). The choice of hyperparameters greatly impacts the training process and resulting model. AutoGluon automatically tries many different values of these hyperparameters and evaluates each hyperparameter setting by measuring the performance of the resulting network on the validation set.

Test Set: A separate set of images, possibly without available labels. These data are never used during any part of the model construction or learning process. If unlabeled, these may correspond to images whose labels we would like to predict. If labeled, these images may correspond to images we reserve for estimating the performance of our final model.

Automatic training/validation split¶

AutoGluon automatically does Training/Validation split:

from autogluon import ImageClassification as task
dataset = task.Dataset('./data/shopeeiet/train')

AutoGluon automatically infers how many classes there are based on the directory structure. By default, AutoGluon automatically constructs the training/validation set split:

Training Set: 80% of images.
Validation Set: 20% of images.

where the images that fall into the validation set are randomly chosen from the training data based on the class.

Step 3: Use AutoGluon fit to generate a classification model¶

Now that we have a Dataset object, we can use AutoGluon’s default configuration to obtain an image classification model using the `fit </api/autogluon.task.html#autogluon.task.ImageClassification.fit>`__ function.

Due to the large size of the Kaggle dataset, calling fit without specifying a time limit may result in long waiting times. Run the following commands to run fit using a time limit:

time_limits = 10 * 60 # 10mins
classifier = task.fit(dataset, time_limits=time_limits,
                      ngpus_per_trial=1)

The top-1 accuracy of the best model on the validation set is:

print('Top-1 val acc: %.3f' % classifier.results['best_reward'])

Using AutoGluon to Generate Predictions on Test Images¶

We can ask our final model to generate predictions on the provided test images. We first load the test data as a Dataset object and then call predict:

test_dataset = task.Dataset('./data/shopeeiet/test', train=False)
inds, probs, probs_all = classifier.predict(test_dataset)

inds above contains the indices of the predicted class for each test image. probs contains the confidence in these class predictions. probs_all contains the confidence in the whole of classes predictions.

Here are the results of AutoGluon’s default fit and predict under different time_limits when executed on a p3.16xlarge EC2 instance:

The validation top-1 accuracy within 5h is 0.842, and ranks 14th place in Kaggle competition.
The validation top-1 accuracy within 24h is 0.846, and ranks 12th place in Kaggle competition.
The validation top-1 accuracy within 72h is 0.852, and ranks 9th place in Kaggle competition.

Step 4: Submit test predictions to Kaggle¶

If you wish to upload the model’s predictions to Kaggle, here is how to convert them into a format suitable for a submission into the Kaggle competition:

import autogluon as ag
ag.utils.generate_csv(inds, './data/shopeeiet/submission.csv')

This produces a submission file located at: ./data/shopeeiet/submission.csv.

To see an example submission, check out sample submission.csv at this link: Data.

To make your own submission, click Submission and then follow the steps in the submission page (upload submission file, describe the submission, and click the Make Submission button). Let’s see how your model fares in this competition!