Tune PyTorch Model on MNIST

In this tutorial, we demonstrate how to do Hyperparameter Optimization (HPO) using AutoGluon with PyTorch. AutoGluon is a framework agnostic HPO toolkit, which is compatible with any training code written in python. The PyTorch code used in this tutorial is adapted from this git repo. In your applications, this code can be replaced with your own PyTorch code.

Import the packages:

import torch
import torch.nn as nn
import torch.nn.functional as F

import torchvision
import torchvision.transforms as transforms
from tqdm.auto import tqdm

Start with an MNIST Example

Data Transforms

We first apply standard image transforms to our training and validation data:

transform = transforms.Compose([
   transforms.Normalize((0.1307,), (0.3081,))

# the datasets
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
Downloading https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-images-idx3-ubyte.gz
Downloading https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz
  0%|          | 0/9912422 [00:00<?, ?it/s]
Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-labels-idx1-ubyte.gz
Downloading https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz
  0%|          | 0/28881 [00:00<?, ?it/s]
Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-images-idx3-ubyte.gz
Downloading https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz
  0%|          | 0/1648877 [00:00<?, ?it/s]
Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz
  0%|          | 0/4542 [00:00<?, ?it/s]
Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw
/var/lib/jenkins/workspace/workspace/autogluon-tutorial-torch-v3/venv/lib/python3.7/site-packages/torchvision/datasets/mnist.py:498: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:180.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)

Main Training Loop

The following train_mnist function represents normal training code a user would write for training on MNIST dataset. Python users typically use an argparser to conveniently change default values. The only additional argument you need to add to your existing python function is a reporter object that is used to store performance achieved under different hyperparameter settings.

def train_mnist(args, reporter):
    # get variables from args
    lr = args.lr
    wd = args.wd
    epochs = args.epochs
    net = args.net
    print('lr: {}, wd: {}'.format(lr, wd))

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    # Model
    net = net.to(device)

    if device == 'cuda':
        net = nn.DataParallel(net)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(net.parameters(), lr=args.lr, momentum=0.9, weight_decay=wd)

    # datasets and dataloaders
    trainset = torchvision.datasets.MNIST(root='./data', train=True, download=False, transform=transform)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

    testset = torchvision.datasets.MNIST(root='./data', train=False, download=False, transform=transform)
    testloader = torch.utils.data.DataLoader(testset, batch_size=128, shuffle=False, num_workers=2)

    # Training
    def train(epoch):
        train_loss, correct, total = 0, 0, 0
        for batch_idx, (inputs, targets) in enumerate(trainloader):
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = net(inputs)
            loss = criterion(outputs, targets)

    def test(epoch):
        test_loss, correct, total = 0, 0, 0
        with torch.no_grad():
            for batch_idx, (inputs, targets) in enumerate(testloader):
                inputs, targets = inputs.to(device), targets.to(device)
                outputs = net(inputs)
                loss = criterion(outputs, targets)

                test_loss += loss.item()
                _, predicted = outputs.max(1)
                total += targets.size(0)
                correct += predicted.eq(targets).sum().item()

        acc = 100.*correct/total
        # 'epoch' reports the number of epochs done
        reporter(epoch=epoch+1, accuracy=acc)

    for epoch in tqdm(range(0, epochs)):

AutoGluon HPO

In this section, we cover how to define a searchable network architecture, convert the training function to be searchable, create the scheduler, and then launch the experiment.

Define a Searchable Network Achitecture

Let’s define a ‘dynamic’ network with searchable configurations by simply adding a decorator autogluon.obj(). In this example, we only search two arguments hidden_conv and hidden_fc, which represent the hidden channels in convolutional layer and fully connected layer. More info about searchable space is available at autogluon.core.space().

import autogluon.core as ag

    hidden_conv=ag.space.Int(6, 12),
    hidden_fc=ag.space.Categorical(80, 120, 160),
class Net(nn.Module):
    def __init__(self, hidden_conv, hidden_fc):
        self.conv1 = nn.Conv2d(1, hidden_conv, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(hidden_conv, 16, 5)
        self.fc1 = nn.Linear(16 * 4 * 4, hidden_fc)
        self.fc2 = nn.Linear(hidden_fc, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 4 * 4)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Convert the Training Function to Be Searchable

We can simply add a decorator autogluon.args() to convert the train_mnist function argument values to be tuned by AutoGluon’s hyperparameter optimizer. In the example below, we specify that the lr argument is a real-value that should be searched on a log-scale in the range 0.01 - 0.2. Before passing lr to your train function, AutoGluon always selects an actual floating point value to assign to lr so you do not need to make any special modifications to your existing code to accommodate the hyperparameter search.

    lr = ag.space.Real(0.01, 0.2, log=True),
    wd = ag.space.Real(1e-4, 5e-4, log=True),
    net = Net(),
def ag_train_mnist(args, reporter):
    return train_mnist(args, reporter)

Create the Scheduler and Launch the Experiment

For hyperparameter tuning, AutoGluon provides a number of different schedulers:

  • FIFOScheduler: Each training jobs runs for the full number of epochs

  • HyperbandScheduler: Uses successive halving and Hyperband scheduling in order to stop unpromising jobs early, so that the available budget is allocated more efficiently

Each scheduler is internally configured by a searcher, which determines the choice of hyperparameter configurations to be run. The default searcher is random: configurations are drawn uniformly at random from the search space.

myscheduler = ag.scheduler.FIFOScheduler(
    resource={'num_cpus': 4, 'num_gpus': 1},
(Remote: Remote REMOTE_ID: 0,
    <Remote: 'inproc://' processes=1 threads=8, memory=30.96 GiB>, Resource: NodeResourceManager(8 CPUs, 1 GPUs))
  0%|          | 0/2 [00:00<?, ?it/s]
lr: 0.0447213595, wd: 0.0002236068
  0%|          | 0/5 [00:00<?, ?it/s]
lr: 0.028245913732173278, wd: 0.00017160776862349292
  0%|          | 0/5 [00:00<?, ?it/s]

We plot the test accuracy achieved over the course of training under each hyperparameter configuration that AutoGluon tried out (represented as different colors).

print('The Best Configuration and Accuracy are: {}, {}'.format(myscheduler.get_best_config(),
The Best Configuration and Accuracy are: {'lr': 0.028245913732173278, 'net▁hidden_conv': 11, 'net▁hidden_fc▁choice': 0, 'wd': 0.00017160776862349292}, 98.96

Search by Bayesian Optimization

While simple to implement, random search is usually not an efficient way to propose configurations for evaluation. AutoGluon provides a number of model-based searchers:

  • Gaussian process based Bayesian optimization (bayesopt)

  • SkOpt Bayesian optimization (skopt; only with FIFO scheduler)

Here, skopt maps to scikit.optimize, whereas bayesopt is an own implementation. While skopt is currently somewhat more versatile (choice of acquisition function, surrogate model), bayesopt is directly optimized to asynchronous parallel scheduling. Importantly, bayesopt runs both with FIFO and Hyperband scheduler (while skopt is restricted to the FIFO scheduler).

When running the following examples, comparing the different schedulers and searchers, you need to increase num_trials (or use time_out instead, which specifies the search budget in terms of wall-clock time) in order to see differences in performance.

myscheduler = ag.scheduler.FIFOScheduler(
    resource={'num_cpus': 4, 'num_gpus': 1},
(Remote: Remote REMOTE_ID: 0,
    <Remote: 'inproc://' processes=1 threads=8, memory=30.96 GiB>, Resource: NodeResourceManager(8 CPUs, 1 GPUs))
  0%|          | 0/2 [00:00<?, ?it/s]
lr: 0.0447213595, wd: 0.0002236068
  0%|          | 0/5 [00:00<?, ?it/s]
lr: 0.02949596812667022, wd: 0.0002726933304854186
  0%|          | 0/5 [00:00<?, ?it/s]

Search by Asynchronous BOHB

When training neural networks, it is often more efficient to use early stopping, and in particular Hyperband scheduling can save a lot of wall-clock time. AutoGluon provides a combination of Hyperband scheduling with asynchronous Bayesian optimization (more details can be found here):

myscheduler = ag.scheduler.HyperbandScheduler(
    resource={'num_cpus': 4, 'num_gpus': 1},
HyperbandScheduler(terminator: HyperbandBracketManager(reward_attr: accuracy, time_attr: epoch, rung_levels: [1, 3], max_t: 5, rung_systems: [Rung system: Iter 3.000: None | Iter 1.000: None])
  0%|          | 0/2 [00:00<?, ?it/s]
lr: 0.0447213595, wd: 0.0002236068
  0%|          | 0/5 [00:00<?, ?it/s]
lr: 0.059324429346096504, wd: 0.0003720939425561732
  0%|          | 0/5 [00:00<?, ?it/s]

Tip: If you like to learn more about HPO algorithms in AutoGluon, please have a look at Getting started with Advanced HPO Algorithms.