{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "487344de", "metadata": {}, "source": [ "# Hyperparameter Optimization with AutoGluon\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/stable/docs/tutorials/tabular/advanced/tabular-hpo.ipynb)\n", "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/stable/docs/tutorials/tabular/advanced/tabular-hpo.ipynb)\n", "\n", "\n", "**Tip**: If you are new to AutoGluon, review [Predicting Columns in a Table - Quick Start](tabular-quick-start.ipynb) to learn the basics of the AutoGluon API.\n", "\n", "This tutorial describes how you can perform hyperparameter optimization (HPO) with AutoGluon-Tabular.\n", "\n", "Using the same census data table as in the [Predicting Columns in a Table - Quick Start](tabular-quick-start.ipynb) tutorial, we'll predict the `occupation` of an individual - a multiclass classification problem. Start by importing AutoGluon's TabularPredictor and TabularDataset, and loading the data." ] }, { "cell_type": "code", "execution_count": null, "id": "aa00faab-252f-44c9-b8f7-57131aa8251c", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "!pip install autogluon.tabular[all]\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fae7a5f3", "metadata": {}, "outputs": [], "source": [ "from autogluon.tabular import TabularDataset, TabularPredictor\n", "\n", "train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')\n", "test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')\n", "subsample_size = 1000 # subsample data for a faster demo\n", "train_data = train_data.sample(n=subsample_size, random_state=0)\n", "\n", "label = 'occupation'\n", "metric = 'accuracy'" ] }, { "cell_type": "markdown", "source": [ "## Specifying hyperparameters and tuning them\n", "\n", "**Note: We don't recommend doing hyperparameter-tuning with AutoGluon in most cases**. AutoGluon achieves its best performance without hyperparameter tuning and simply specifying one of the available presets, such as `presets=\"best_quality\"`.\n", "\n", "We demonstrate hyperparameter-tuning and how you can provide your own validation dataset that AutoGluon internally relies on to: tune hyperparameters, early-stop iterative training, and construct model ensembles. One reason you may specify validation data is when future test data will stem from a different distribution than training data (and your specified validation data is more representative of the future data that will likely be encountered).\n", "\n", " If you don't have a strong reason to provide your own validation dataset, we recommend you omit the `tuning_data` argument. This lets AutoGluon automatically select validation data from your provided training set (it uses smart strategies such as stratified sampling). For greater control, you can specify the `holdout_frac` argument to tell AutoGluon what fraction of the provided training data to hold out for validation.\n", "\n", "**Caution:** Since AutoGluon tunes internal knobs based on this validation data, performance estimates reported on this data may be over-optimistic. For unbiased performance estimates, you should always call `predict()` on a separate dataset (that was never passed to `fit()`), as we did in the previous **Quick-Start** tutorial. We also emphasize that most options specified in this tutorial are chosen to minimize runtime for the purposes of demonstration and you should select more reasonable values in order to obtain high-quality models.\n", "\n", "`fit()` trains neural networks and various types of tree ensembles by default. You can specify various hyperparameter values for each type of model. For each hyperparameter, you can either specify a single fixed value, or a search space of values to consider during hyperparameter optimization. Hyperparameters which you do not specify are left at default settings chosen automatically by AutoGluon, which may be fixed values or search spaces.\n", "\n", "Refer to the [Search Space documentation](../../../api/autogluon.common.space.rst) to learn more about AutoGluon search space." ], "metadata": { "collapsed": false }, "id": "98733672" }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "from autogluon.common import space\n", "\n", "nn_options = { # specifies non-default hyperparameter values for neural network models\n", " 'num_epochs': 10, # number of training epochs (controls training time of NN models)\n", " 'learning_rate': space.Real(1e-4, 1e-2, default=5e-4, log=True), # learning rate used in training (real-valued hyperparameter searched on log-scale)\n", " 'activation': space.Categorical('relu', 'softrelu', 'tanh'), # activation function used in NN (categorical hyperparameter, default = first entry)\n", " 'dropout_prob': space.Real(0.0, 0.5, default=0.1), # dropout probability (real-valued hyperparameter)\n", "}\n", "\n", "gbm_options = { # specifies non-default hyperparameter values for lightGBM gradient boosted trees\n", " 'num_boost_round': 100, # number of boosting rounds (controls training time of GBM models)\n", " 'num_leaves': space.Int(lower=26, upper=66, default=36), # number of leaves in trees (integer hyperparameter)\n", "}\n", "\n", "hyperparameters = { # hyperparameters of each model type\n", " 'GBM': gbm_options,\n", " 'NN_TORCH': nn_options, # NOTE: comment this line out if you get errors on Mac OSX\n", "} # When these keys are missing from hyperparameters dict, no models of that type are trained\n", "\n", "time_limit = 2*60 # train various models for ~2 min\n", "num_trials = 5 # try at most 5 different hyperparameter configurations for each type of model\n", "search_strategy = 'auto' # to tune hyperparameters using random search routine with a local scheduler\n", "\n", "hyperparameter_tune_kwargs = { # HPO is not performed unless hyperparameter_tune_kwargs is specified\n", " 'num_trials': num_trials,\n", " 'scheduler' : 'local',\n", " 'searcher': search_strategy,\n", "} # Refer to TabularPredictor.fit docstring for all valid values\n", "\n", "predictor = TabularPredictor(label=label, eval_metric=metric).fit(\n", " train_data,\n", " time_limit=time_limit,\n", " hyperparameters=hyperparameters,\n", " hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,\n", ")" ], "metadata": { "collapsed": false }, "id": "87f28cf4" }, { "cell_type": "markdown", "source": "Use the trained models to predict on the test data:", "metadata": { "collapsed": false }, "id": "816e4beb" }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": "predictor.predict_proba(test_data)", "metadata": { "collapsed": false }, "id": "3bf2965a" }, { "cell_type": "markdown", "source": "Use leaderboard to see how each model performs on the test data:", "metadata": { "collapsed": false }, "id": "5c2f4648" }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": "predictor.leaderboard(test_data)", "metadata": { "collapsed": false }, "id": "1bfc4fe3" }, { "cell_type": "markdown", "source": "In the above example, the predictive performance may be poor because we specified very little training to ensure quick runtimes. You can call `fit()` multiple times while modifying the above settings to better understand how these choices affect performance outcomes. For example: you can increase `subsample_size` to train using a larger dataset, increase the `num_epochs` and `num_boost_round` hyperparameters, and increase the `time_limit` (which you should do for all code in these tutorials). To see more detailed output during the execution of `fit()`, you can also pass in the argument: `verbosity=3`.", "metadata": { "collapsed": false }, "id": "1d06b7ab" } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }