.. _sec_customstorch:

MNIST Training in PyTorch
=========================


In this tutorial, we demonstrate how to do Hyperparameter Optimization
(HPO) using AutoGluon with PyTorch. AutoGluon is a framework agnostic
HPO toolkit, which is compatible with any training code written in
python. The PyTorch code used in this tutorial is adapted from this `git
repo <https://github.com/kuangliu/pytorch-cifar>`__. In your
applications, this code can be replaced with your own PyTorch code.

Import the packages:

.. code:: python

    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    
    import torchvision
    import torchvision.transforms as transforms
    from tqdm.auto import tqdm

Start with an MNIST Example
---------------------------

Data Transforms
~~~~~~~~~~~~~~~

We first apply standard image transforms to our training and validation
data:

.. code:: python

    transform = transforms.Compose([
       transforms.ToTensor(),
       transforms.Normalize((0.1307,), (0.3081,))
    ])
    
    # the datasets
    trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
    testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)


.. parsed-literal::
    :class: output

    Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…


.. parsed-literal::
    :class: output

    Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw
    Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…


.. parsed-literal::
    :class: output

    Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw
    Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…


.. parsed-literal::
    :class: output

    Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw
    Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz
    

.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…


.. parsed-literal::
    :class: output

    Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw
    Processing...
    Done!


.. parsed-literal::
    :class: output

    /var/lib/jenkins/miniconda3/envs/autogluon_docs/lib/python3.7/site-packages/torchvision/datasets/mnist.py:469: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
      return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)


Main Training Loop
~~~~~~~~~~~~~~~~~~

The following ``train_mnist`` function represents normal training code a
user would write for training on MNIST dataset. Python users typically
use an argparser to conveniently change default values. The only
additional argument you need to add to your existing python function is
a reporter object that is used to store performance achieved under
different hyperparameter settings.

.. code:: python

    def train_mnist(args, reporter):
        # get variables from args
        lr = args.lr
        wd = args.wd
        epochs = args.epochs
        net = args.net
        print('lr: {}, wd: {}'.format(lr, wd))
    
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        # Model
        net = net.to(device)
    
        if device == 'cuda':
            net = nn.DataParallel(net)
        criterion = nn.CrossEntropyLoss()
        optimizer = torch.optim.SGD(net.parameters(), lr=args.lr, momentum=0.9, weight_decay=wd)
    
        # datasets and dataloaders
        trainset = torchvision.datasets.MNIST(root='./data', train=True, download=False, transform=transform)
        trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
    
        testset = torchvision.datasets.MNIST(root='./data', train=False, download=False, transform=transform)
        testloader = torch.utils.data.DataLoader(testset, batch_size=128, shuffle=False, num_workers=2)
    
        # Training
        def train(epoch):
            net.train()
            train_loss, correct, total = 0, 0, 0
            for batch_idx, (inputs, targets) in enumerate(trainloader):
                inputs, targets = inputs.to(device), targets.to(device)
                optimizer.zero_grad()
                outputs = net(inputs)
                loss = criterion(outputs, targets)
                loss.backward()
                optimizer.step()
    
        def test(epoch):
            net.eval()
            test_loss, correct, total = 0, 0, 0
            with torch.no_grad():
                for batch_idx, (inputs, targets) in enumerate(testloader):
                    inputs, targets = inputs.to(device), targets.to(device)
                    outputs = net(inputs)
                    loss = criterion(outputs, targets)
    
                    test_loss += loss.item()
                    _, predicted = outputs.max(1)
                    total += targets.size(0)
                    correct += predicted.eq(targets).sum().item()
    
            acc = 100.*correct/total
            # 'epoch' reports the number of epochs done
            reporter(epoch=epoch+1, accuracy=acc)
    
        for epoch in tqdm(range(0, epochs)):
            train(epoch)
            test(epoch)

AutoGluon HPO
-------------

In this section, we cover how to define a searchable network
architecture, convert the training function to be searchable, create the
scheduler, and then launch the experiment.

Define a Searchable Network Achitecture
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's define a 'dynamic' network with searchable configurations by
simply adding a decorator :func:`autogluon.obj`. In this example, we
only search two arguments ``hidden_conv`` and ``hidden_fc``, which
represent the hidden channels in convolutional layer and fully connected
layer. More info about searchable space is available at
:meth:`autogluon.core.space`.

.. code:: python

    import autogluon.core as ag
    
    @ag.obj(
        hidden_conv=ag.space.Int(6, 12),
        hidden_fc=ag.space.Categorical(80, 120, 160),
    )
    class Net(nn.Module):
        def __init__(self, hidden_conv, hidden_fc):
            super().__init__()
            self.conv1 = nn.Conv2d(1, hidden_conv, 5)
            self.pool = nn.MaxPool2d(2, 2)
            self.conv2 = nn.Conv2d(hidden_conv, 16, 5)
            self.fc1 = nn.Linear(16 * 4 * 4, hidden_fc)
            self.fc2 = nn.Linear(hidden_fc, 84)
            self.fc3 = nn.Linear(84, 10)
    
        def forward(self, x):
            x = self.pool(F.relu(self.conv1(x)))
            x = self.pool(F.relu(self.conv2(x)))
            x = x.view(-1, 16 * 4 * 4)
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return x


.. parsed-literal::
    :class: output

    
Convert the Training Function to Be Searchable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can simply add a decorator :func:`autogluon.args` to convert the
``train_mnist`` function argument values to be tuned by AutoGluon's
hyperparameter optimizer. In the example below, we specify that the lr
argument is a real-value that should be searched on a log-scale in the
range 0.01 - 0.2. Before passing lr to your train function, AutoGluon
always selects an actual floating point value to assign to lr so you do
not need to make any special modifications to your existing code to
accommodate the hyperparameter search.

.. code:: python

    @ag.args(
        lr = ag.space.Real(0.01, 0.2, log=True),
        wd = ag.space.Real(1e-4, 5e-4, log=True),
        net = Net(),
        epochs=5,
    )
    def ag_train_mnist(args, reporter):
        return train_mnist(args, reporter)

Create the Scheduler and Launch the Experiment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For hyperparameter tuning, AutoGluon provides a number of different
schedulers:

-  ``FIFOScheduler``: Each training jobs runs for the full number of
   epochs
-  ``HyperbandScheduler``: Uses successive halving and Hyperband
   scheduling in order to stop unpromising jobs early, so that the
   available budget is allocated more efficiently

Each scheduler is internally configured by a searcher, which determines
the choice of hyperparameter configurations to be run. The default
searcher is ``random``: configurations are drawn uniformly at random
from the search space.

.. code:: python

    myscheduler = ag.scheduler.FIFOScheduler(
        ag_train_mnist,
        resource={'num_cpus': 4, 'num_gpus': 1},
        num_trials=2,
        time_attr='epoch',
        reward_attr='accuracy')
    print(myscheduler)


.. parsed-literal::
    :class: output

    FIFOScheduler(
    DistributedResourceManager{
    (Remote: Remote REMOTE_ID: 0, 
    	<Remote: 'inproc://172.31.35.213/6783/1' processes=1 threads=8, memory=33.24 GB>, Resource: NodeResourceManager(8 CPUs, 1 GPUs))
    })
    

.. code:: python

    myscheduler.run()
    myscheduler.join_jobs()


.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=2.0), HTML(value='')))


.. parsed-literal::
    :class: output

    lr: 0.0447213595, wd: 0.0002236068


.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=5.0), HTML(value='')))


.. parsed-literal::
    :class: output

    
    lr: 0.028245913732173278, wd: 0.00017160776862349322


.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=5.0), HTML(value='')))


.. parsed-literal::
    :class: output

    
We plot the test accuracy achieved over the course of training under
each hyperparameter configuration that AutoGluon tried out (represented
as different colors).

.. code:: python

    myscheduler.get_training_curves(plot=True,use_legend=False)
    print('The Best Configuration and Accuracy are: {}, {}'.format(myscheduler.get_best_config(),
                                                                   myscheduler.get_best_reward()))


.. figure:: output_hpo_148be2_14_0.png


.. parsed-literal::
    :class: output

    The Best Configuration and Accuracy are: {'lr': 0.0447213595, 'net▁hidden_conv': 9, 'net▁hidden_fc▁choice': 0, 'wd': 0.0002236068}, 98.98


Search by Bayesian Optimization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

While simple to implement, random search is usually not an efficient way
to propose configurations for evaluation. AutoGluon provides a number of
model-based searchers:

-  Gaussian process based Bayesian optimization (``bayesopt``)
-  SkOpt Bayesian optimization (``skopt``; only with FIFO scheduler)

Here, ``skopt`` maps to
`scikit.optimize <https://scikit-optimize.github.io/stable/>`__, whereas
``bayesopt`` is an own implementation. While ``skopt`` is currently
somewhat more versatile (choice of acquisition function, surrogate
model), ``bayesopt`` is directly optimized to asynchronous parallel
scheduling. Importantly, ``bayesopt`` runs both with FIFO and Hyperband
scheduler (while ``skopt`` is restricted to the FIFO scheduler).

When running the following examples, comparing the different schedulers
and searchers, you need to increase ``num_trials`` (or use ``time_out``
instead, which specifies the search budget in terms of wall-clock time)
in order to see differences in performance.

.. code:: python

    myscheduler = ag.scheduler.FIFOScheduler(
        ag_train_mnist,
        resource={'num_cpus': 4, 'num_gpus': 1},
        searcher='bayesopt',
        num_trials=2,
        time_attr='epoch',
        reward_attr='accuracy')
    print(myscheduler)


.. parsed-literal::
    :class: output

    FIFOScheduler(
    DistributedResourceManager{
    (Remote: Remote REMOTE_ID: 0, 
    	<Remote: 'inproc://172.31.35.213/6783/1' processes=1 threads=8, memory=33.24 GB>, Resource: NodeResourceManager(8 CPUs, 1 GPUs))
    })
    

.. code:: python

    myscheduler.run()
    myscheduler.join_jobs()


.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=2.0), HTML(value='')))


.. parsed-literal::
    :class: output

    lr: 0.0447213595, wd: 0.0002236068


.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=5.0), HTML(value='')))


.. parsed-literal::
    :class: output

    
    lr: 0.012211534616190458, wd: 0.0003746004250883022


.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=5.0), HTML(value='')))


.. parsed-literal::
    :class: output

    
Search by Asynchronous BOHB
~~~~~~~~~~~~~~~~~~~~~~~~~~~

When training neural networks, it is often more efficient to use early
stopping, and in particular Hyperband scheduling can save a lot of
wall-clock time. AutoGluon provides a combination of Hyperband
scheduling with asynchronous Bayesian optimization (more details can be
found `here <https://arxiv.org/abs/2003.10865>`__):

.. code:: python

    myscheduler = ag.scheduler.HyperbandScheduler(
        ag_train_mnist,
        resource={'num_cpus': 4, 'num_gpus': 1},
        searcher='bayesopt',
        num_trials=2,
        time_attr='epoch',
        reward_attr='accuracy',
        grace_period=1,
        reduction_factor=3,
        brackets=1)
    print(myscheduler)


.. parsed-literal::
    :class: output

    HyperbandScheduler(terminator: HyperbandStopping_Manager(reward_attr: accuracy, time_attr: epoch, reduction_factor: 3, max_t: 5, brackets: [Bracket: Iter 3.000: None | Iter 1.000: None])


.. code:: python

    myscheduler.run()
    myscheduler.join_jobs()


.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=2.0), HTML(value='')))


.. parsed-literal::
    :class: output

    lr: 0.0447213595, wd: 0.0002236068


.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=5.0), HTML(value='')))


.. parsed-literal::
    :class: output

    
    lr: 0.02329669348266227, wd: 0.0001016321742732039


.. parsed-literal::
    :class: output

    HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=5.0), HTML(value='')))


.. parsed-literal::
    :class: output

    
**Tip**: If you like to learn more about HPO algorithms in AutoGluon,
please have a look at :ref:`sec_custom_advancedhpo`.