.. _sec_tabularcustommetric: Adding a custom metric to AutoGluon =================================== **Tip**: If you are new to AutoGluon, review :ref:`sec_tabularquick` to learn the basics of the AutoGluon API. This tutorial describes how to add a custom evaluation metric to AutoGluon that is used to inform validation scores, model ensembling, hyperparameter tuning, and more. In this example, we show a variety of evaluation metrics and how to convert them to an AutoGluon Scorer, which can then be passed to AutoGluon models and predictors. First, we will randomly generate 10 ground truth labels and predictions, and show how to calculate metric scores from them. .. code:: python import numpy as np y_true = np.random.randint(low=0, high=2, size=10) y_pred = np.random.randint(low=0, high=2, size=10) print(f'y_true: {y_true}') print(f'y_pred: {y_pred}') .. parsed-literal:: :class: output y_true: [0 1 1 1 0 1 1 1 1 0] y_pred: [0 0 0 1 1 0 1 1 1 1] Custom Accuracy Metric ---------------------- We will start with calculating accuracy. A prediction is correct if the predicted value is the same as the true value, otherwise it is wrong. .. code:: python import sklearn.metrics sklearn.metrics.accuracy_score(y_true, y_pred) .. parsed-literal:: :class: output 0.5 Now, lets convert this evaluation metric to an AutoGluon Scorer. We do this by calling ``autogluon.core.metrics.make_scorer``. .. code:: python from autogluon.core.metrics import make_scorer ag_accuracy_scorer = make_scorer(name='accuracy', score_func=sklearn.metrics.accuracy_score, optimum=1, greater_is_better=True) When creating the Scorer, we need to specify a name for the Scorer. This does not need to be any particular value, but is used when printing information about the Scorer during training. Next, we specify the ``score_func``. This is the function we want to wrap, in this case, sklearn's accuracy\_score function. We then need to specify the optimum value. This is necessary when calculating error as opposed to score. Error is calculated as ``optimum - score``. It is also useful to identify when a score is optimal and cannot be improved. Finally, we need to specify ``greater_is_better``. In this case, ``greater_is_better=True`` because the best value returned is 1, and the worst value returned is less than 1 (0). It is very important to set this value correctly, otherwise AutoGluon will try to optimize for the **worst** model instead of the best. Once created, the AutoGluon Scorer can be called in the same fashion as the original metric. .. code:: python ag_accuracy_scorer(y_true, y_pred) .. parsed-literal:: :class: output 0.5 Custom Mean Squared Error Metric -------------------------------- Next, lets show examples of how to convert regression metrics into Scorers. First we generate random ground truth labels and their predictions, however this time they are floats instead of integers. .. code:: python y_true = np.random.rand(10) y_pred = np.random.rand(10) print(f'y_true: {y_true}') print(f'y_pred: {y_pred}') .. parsed-literal:: :class: output y_true: [0.50694008 0.23862178 0.00379843 0.61517489 0.90021045 0.8066451 0.13151691 0.20783484 0.25537695 0.62505196] y_pred: [0.10423373 0.20812381 0.93108274 0.13547359 0.20938752 0.45620474 0.76800387 0.4387486 0.15857253 0.8696109 ] A common regression metric is Mean Squared Error: .. code:: python sklearn.metrics.mean_squared_error(y_true, y_pred) .. parsed-literal:: :class: output 0.23807337929909225 .. code:: python ag_mean_squared_error_scorer = make_scorer(name='mean_squared_error', score_func=sklearn.metrics.mean_squared_error, optimum=0, greater_is_better=False) In this case, optimum is 0 because this is an error metric. Additionally, ``greater_is_better=False`` because sklearn reports error as positive values, and the lower the value is, the better. A very important point about AutoGluon Scorers is that internally, they will always report scores in ``greater_is_better=True`` form. This means if the original metric was ``greater_is_better=False``, AutoGluon's Scorer will flip the value. Therefore, error will be represented as negative values. This is done to ensure consistency between different metrics. .. code:: python ag_mean_squared_error_scorer(y_true, y_pred) .. parsed-literal:: :class: output -0.23807337929909225 We can also specify metrics outside of sklearn. For example, below is a minimal implementation of mean squared error: .. code:: python def mse_func(y_true: np.ndarray, y_pred: np.ndarray) -> float: return ((y_true - y_pred) ** 2).mean() mse_func(y_true, y_pred) .. parsed-literal:: :class: output 0.23807337929909225 All that is required is that the function take two arguments: ``y_true``, and ``y_pred`` (or ``y_pred_proba``), as numpy arrays, and return a float value. With the same code as before, we can create an AutoGluon Scorer. .. code:: python ag_mean_squared_error_custom_scorer = make_scorer(name='mean_squared_error', score_func=mse_func, optimum=0, greater_is_better=False) ag_mean_squared_error_custom_scorer(y_true, y_pred) .. parsed-literal:: :class: output -0.23807337929909225 Custom ROC AUC Metric --------------------- Here we show an example of a thresholding metric, ``roc_auc``. A thresholding metric cares about the relative ordering of predictions, but not their absolute values. .. code:: python y_true = np.random.randint(low=0, high=2, size=10) y_pred_proba = np.random.rand(10) print(f'y_true: {y_true}') print(f'y_pred_proba: {y_pred_proba}') .. parsed-literal:: :class: output y_true: [0 0 1 1 1 1 1 0 1 1] y_pred_proba: [0.95737253 0.17718424 0.39690819 0.53387794 0.98888768 0.09954354 0.37815557 0.14345184 0.31584168 0.32894318] .. code:: python sklearn.metrics.roc_auc_score(y_true, y_pred_proba) .. parsed-literal:: :class: output 0.6190476190476191 We will need to specify ``needs_threshold=True`` in order for downstream models to properly use the metric. .. code:: python # Score functions that need decision values ag_roc_auc_scorer = make_scorer(name='roc_auc', score_func=sklearn.metrics.roc_auc_score, optimum=1, greater_is_better=True, needs_threshold=True) ag_roc_auc_scorer(y_true, y_pred_proba) .. parsed-literal:: :class: output 0.6190476190476191 Using Custom Metrics in TabularPredictor ---------------------------------------- Now that we have created several custom Scorers, lets use them for training and evaluating models. For this tutorial, we will be using the Adult Income dataset. .. code:: python from autogluon.tabular import TabularDataset train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv') # can be local CSV file as well, returns Pandas DataFrame test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv') # another Pandas DataFrame label = 'class' # specifies which column do we want to predict train_data = train_data.sample(n=1000, random_state=0) # subsample for faster demo train_data.head(5) .. raw:: html
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country class
6118 51 Private 39264 Some-college 10 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States >50K
23204 58 Private 51662 10th 6 Married-civ-spouse Other-service Wife White Female 0 0 8 United-States <=50K
29590 40 Private 326310 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 44 United-States <=50K
18116 37 Private 222450 HS-grad 9 Never-married Sales Not-in-family White Male 0 2339 40 El-Salvador <=50K
33964 62 Private 109190 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 15024 0 40 United-States >50K
.. code:: python from autogluon.tabular import TabularPredictor predictor = TabularPredictor(label=label).fit(train_data, hyperparameters='toy') predictor.leaderboard(test_data, silent=True) .. parsed-literal:: :class: output No path specified. Models will be saved in: "AutogluonModels/ag-20210831_214705/" Beginning AutoGluon training ... AutoGluon will save models to "AutogluonModels/ag-20210831_214705/" AutoGluon Version: 0.3.1b20210831 Train Data Rows: 1000 Train Data Columns: 14 Preprocessing data ... AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed). 2 unique label values: [' >50K', ' <=50K'] If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression']) Selected class <--> label mapping: class 1 = >50K, class 0 = <=50K Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class. To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init. Using Feature Generators to preprocess the data ... Fitting AutoMLPipelineFeatureGenerator... Available Memory: 22188.73 MB Train Data (Original) Memory Usage: 0.59 MB (0.0% of available memory) Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features. Stage 1 Generators: Fitting AsTypeFeatureGenerator... Note: Converting 1 features to boolean dtype as they only contain 2 unique values. Stage 2 Generators: Fitting FillNaFeatureGenerator... Stage 3 Generators: Fitting IdentityFeatureGenerator... Fitting CategoryFeatureGenerator... Fitting CategoryMemoryMinimizeFeatureGenerator... Stage 4 Generators: Fitting DropUniqueFeatureGenerator... Types of features in original data (raw dtype, special dtypes): ('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...] ('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...] Types of features in processed data (raw dtype, special dtypes): ('category', []) : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...] ('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...] ('int', ['bool']) : 1 | ['sex'] 0.1s = Fit runtime 14 features in original data used to generate 14 features in processed data. Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory) Data preprocessing and feature engineering runtime = 0.09s ... AutoGluon will gauge predictive performance using evaluation metric: 'accuracy' To change this, specify the eval_metric argument of fit() Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 800, Val Rows: 200 Fitting 4 L1 models ... Fitting model: LightGBM ... 0.77 = Validation score (accuracy) 0.33s = Training runtime 0.01s = Validation runtime Fitting model: CatBoost ... 0.86 = Validation score (accuracy) 0.1s = Training runtime 0.01s = Validation runtime Fitting model: XGBoost ... 0.84 = Validation score (accuracy) 0.09s = Training runtime 0.01s = Validation runtime Fitting model: NeuralNetMXNet ... 0.76 = Validation score (accuracy) 1.91s = Training runtime 0.03s = Validation runtime Fitting model: WeightedEnsemble_L2 ... 0.87 = Validation score (accuracy) 0.11s = Training runtime 0.0s = Validation runtime AutoGluon training complete, total runtime = 2.8s ... TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20210831_214705/") .. raw:: html
model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 XGBoost 0.847784 0.84 0.026014 0.005666 0.094983 0.026014 0.005666 0.094983 1 True 3
1 CatBoost 0.844406 0.86 0.013141 0.008546 0.098347 0.013141 0.008546 0.098347 1 True 2
2 WeightedEnsemble_L2 0.842870 0.87 1.010880 0.039434 2.123772 0.002043 0.000626 0.110719 2 True 5
3 NeuralNetMXNet 0.782885 0.76 0.995696 0.030262 1.914706 0.995696 0.030262 1.914706 1 True 4
4 LightGBM 0.780940 0.77 0.008550 0.007176 0.334513 0.008550 0.007176 0.334513 1 True 1
We can pass our custom metrics into ``predictor.leaderboard`` via the ``extra_metrics`` argument: .. code:: python predictor.leaderboard(test_data, extra_metrics=[ag_roc_auc_scorer, ag_accuracy_scorer], silent=True) .. raw:: html
model score_test roc_auc accuracy score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 XGBoost 0.847784 0.894112 0.847784 0.84 0.024795 0.005666 0.094983 0.024795 0.005666 0.094983 1 True 3
1 CatBoost 0.844406 0.863760 0.844406 0.86 0.013006 0.008546 0.098347 0.013006 0.008546 0.098347 1 True 2
2 WeightedEnsemble_L2 0.842870 0.878875 0.842870 0.87 0.956714 0.039434 2.123772 0.001939 0.000626 0.110719 2 True 5
3 NeuralNetMXNet 0.782885 0.786681 0.782885 0.76 0.941769 0.030262 1.914706 0.941769 0.030262 1.914706 1 True 4
4 LightGBM 0.780940 0.861131 0.780940 0.77 0.008325 0.007176 0.334513 0.008325 0.007176 0.334513 1 True 1
We can also pass our custom metric into the Predictor itself by specifying it during initialization via the ``eval_metric`` parameter: .. code:: python predictor_custom = TabularPredictor(label=label, eval_metric=ag_roc_auc_scorer).fit(train_data, hyperparameters='toy') predictor_custom.leaderboard(test_data, silent=True) .. parsed-literal:: :class: output No path specified. Models will be saved in: "AutogluonModels/ag-20210831_214710/" Beginning AutoGluon training ... AutoGluon will save models to "AutogluonModels/ag-20210831_214710/" AutoGluon Version: 0.3.1b20210831 Train Data Rows: 1000 Train Data Columns: 14 Preprocessing data ... AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed). 2 unique label values: [' >50K', ' <=50K'] If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression']) Selected class <--> label mapping: class 1 = >50K, class 0 = <=50K Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class. To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init. Using Feature Generators to preprocess the data ... Fitting AutoMLPipelineFeatureGenerator... Available Memory: 22052.75 MB Train Data (Original) Memory Usage: 0.59 MB (0.0% of available memory) Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features. Stage 1 Generators: Fitting AsTypeFeatureGenerator... Note: Converting 1 features to boolean dtype as they only contain 2 unique values. Stage 2 Generators: Fitting FillNaFeatureGenerator... Stage 3 Generators: Fitting IdentityFeatureGenerator... Fitting CategoryFeatureGenerator... Fitting CategoryMemoryMinimizeFeatureGenerator... Stage 4 Generators: Fitting DropUniqueFeatureGenerator... Types of features in original data (raw dtype, special dtypes): ('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...] ('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...] Types of features in processed data (raw dtype, special dtypes): ('category', []) : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...] ('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...] ('int', ['bool']) : 1 | ['sex'] 0.1s = Fit runtime 14 features in original data used to generate 14 features in processed data. Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory) Data preprocessing and feature engineering runtime = 0.09s ... AutoGluon will gauge predictive performance using evaluation metric: 'roc_auc' This metric expects predicted probabilities rather than predicted class labels, so you'll need to use predict_proba() instead of predict() To change this, specify the eval_metric argument of fit() Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 800, Val Rows: 200 Fitting 4 L1 models ... Fitting model: LightGBM ... 0.85 = Validation score (roc_auc) 0.1s = Training runtime 0.01s = Validation runtime Fitting model: CatBoost ... 0.8693 = Validation score (roc_auc) 0.04s = Training runtime 0.01s = Validation runtime Fitting model: XGBoost ... 0.8585 = Validation score (roc_auc) 0.03s = Training runtime 0.01s = Validation runtime Fitting model: NeuralNetMXNet ... 0.8096 = Validation score (roc_auc) 1.07s = Training runtime 0.03s = Validation runtime Fitting model: WeightedEnsemble_L2 ... 0.878 = Validation score (roc_auc) 0.4s = Training runtime 0.0s = Validation runtime AutoGluon training complete, total runtime = 1.87s ... TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20210831_214710/") .. raw:: html
model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L2 0.897874 0.878010 1.113530 0.051806 1.624670 0.002897 0.001139 0.395167 2 True 5
1 XGBoost 0.894331 0.858534 0.025804 0.005950 0.027178 0.025804 0.005950 0.027178 1 True 3
2 CatBoost 0.887425 0.869325 0.013491 0.008617 0.036230 0.013491 0.008617 0.036230 1 True 2
3 LightGBM 0.870968 0.849980 0.008809 0.006791 0.095563 0.008809 0.006791 0.095563 1 True 1
4 NeuralNetMXNet 0.833452 0.809580 1.062528 0.029309 1.070532 1.062528 0.029309 1.070532 1 True 4
That's all it takes to create and use custom metrics in AutoGluon! If you create a custom metric, consider `submitting a PR `__ so that we can add it officially to AutoGluon! For a tutorial on implementing custom models in AutoGluon, refer to :ref:`sec_tabularcustommodel`. For more tutorials, refer to :ref:`sec_tabularquick` and :ref:`sec_tabularadvanced`.